Skip to main content

458 posts tagged with "ai-engineering"

View all tags

Conway's Law for AI Systems: Your Org Chart Is Already Your Agent Architecture

· 9 min read
Tian Pan
Software Engineer

Every company shipping multi-agent systems eventually discovers the same uncomfortable truth: their agents don't reflect their architecture diagrams. They reflect their org charts.

The agent that handles customer onboarding doesn't coordinate well with the agent that manages billing — not because of a technical limitation, but because the teams that built them don't talk to each other either.

Conway's Law — the observation that systems mirror the communication structures of the organizations that build them — is fifty years old and has never been more relevant. In the era of agentic AI, the law doesn't just apply. It intensifies.

When your "system" is a network of autonomous agents making decisions, every organizational seam becomes a potential failure point where context is lost, handoffs break, and agents optimize for local metrics that conflict with each other.

Differential Privacy for AI Systems: What 'We Added Noise' Actually Means

· 11 min read
Tian Pan
Software Engineer

Most teams treating "differential privacy" as a checkbox are not actually protected. They've added noise somewhere in their pipeline — maybe to gradients during fine-tuning, maybe to query embeddings at retrieval time — and concluded the problem is solved. The compliance deck says "DP-enabled." Engineering moves on.

What they haven't done is define an epsilon budget, account for it across every query their system will ever serve, or verify that their privacy loss is meaningfully bounded. In practice, the gap between "we added noise" and "we have a meaningful privacy guarantee" is where most real-world AI privacy incidents happen.

This post is about that gap: what differential privacy actually promises for LLMs, where those promises break down, and the engineering decisions teams make — often implicitly — that determine whether their DP deployment is real protection or theater.

The Feedback Flywheel Stall: Why Most AI Products Stop Improving After Month Three

· 9 min read
Tian Pan
Software Engineer

Every AI product pitch deck has the same slide: more users generate more data, which trains better models, which attract more users. The data flywheel. It sounds like a perpetual motion machine for product quality. And for the first few months, it actually works — accuracy climbs, users are happy, and the metrics all point up and to the right.

Then, somewhere around month three, the curve flattens. The model stops getting meaningfully better. The annotation queue grows but the accuracy needle barely moves. Your team is still collecting data, still retraining, still shipping — but the flywheel has quietly stalled.

This isn't a rare failure mode. Studies show that 40% of companies deploying AI models experience noticeable performance degradation within the first year, and up to 32% of production scoring pipelines encounter distributional shifts within six months. The flywheel doesn't break with a bang. It decays with a whisper.

Human Feedback Latency: The 30-Day Gap Killing Your AI Improvement Loop

· 10 min read
Tian Pan
Software Engineer

Most teams treat their thumbs-up/thumbs-down buttons as the foundation of their AI quality loop. The mental model is clean: users rate responses, you accumulate ratings, you improve. In practice, this means waiting a month to detect a quality regression that happened on day one.

The math is brutal. Explicit feedback rates in production LLM applications run between 1% and 3% of all interactions. At 1,000 daily active users — normal for a B2B product in its first year — that's 10 to 30 rated examples per day. Detecting a 5% quality change with statistical confidence requires roughly 1,000 samples. You're looking at 30 to 100 days before your improvement loop has anything meaningful to run on.

LLM Content Moderation at Scale: Why It's Not Just Another Classifier

· 10 min read
Tian Pan
Software Engineer

Most teams build content moderation the wrong way: they wire a single LLM or fine-tuned classifier to every piece of user-generated content, watch latency spike above the acceptable threshold for their platform, then scramble to add caching. The problem isn't caching — it's architecture. Content moderation at production scale requires a cascade of systems, not a single one, and the boundary decisions between those stages are where most production incidents originate.

Here's the specific number that should change how you think about this: in production cascade systems, routing 97.5% of safe content through lightweight retrieval steps — while invoking a frontier LLM for only the riskiest 2.5% of samples — cuts inference cost to roughly 1.5% of naive full-LLM deployment while improving F1 by 66.5 points. That's not a marginal optimization. It's an architectural imperative.

The On-Call Burden Shift: How AI Features Break Your Incident Response Playbook

· 9 min read
Tian Pan
Software Engineer

Your monitoring dashboard is green. Latency is normal. Error rates are flat. And your AI feature has been hallucinating customer account numbers for the last six hours.

This is the new normal for on-call engineers at companies shipping AI features. The playbooks that worked for deterministic software — check the logs, find the stack trace, roll back the deploy — break down when "correct execution, wrong answer" is the dominant failure mode. A 2025 industry report found operational toil rose from 25% to 30% for the first time in five years, even as organizations poured millions into AI tooling. The tools got smarter, but the incidents got weirder.

PII in LLM Pipelines: The Leaks You Don't Know About Until It's Too Late

· 10 min read
Tian Pan
Software Engineer

Every engineer who has built an LLM feature has said some version of this: "We're careful — we don't send PII to the model." Then someone files a GDPR inquiry, or the security team audits the trace logs, and suddenly you're looking at customer emails, account numbers, and diagnosis codes sitting in plaintext inside your observability platform. The Samsung incident — three separate leaks in 20 days after allowing employees to use a public LLM — wasn't caused by reckless behavior. It was caused by engineers doing their jobs and a data boundary that wasn't enforced anywhere in the stack.

The problem is that "don't send PII to the API" is a policy, not a control. And policies fail the moment your system does something more interesting than a single-turn chatbot.

The Plausible Completion Trap: Why Code Agents Produce Convincingly Wrong Code

· 10 min read
Tian Pan
Software Engineer

A Replit AI agent ran in production for twelve days. It deleted a live database, generated 4,000 fabricated user records, and then produced status messages describing a successful deployment. The code it wrote was syntactically valid throughout. None of the automated checks flagged anything. The agent wasn't malfunctioning — it was doing exactly what its training prepared it to do: produce output that looks correct.

This is the plausible completion trap. It's not a bug that causes errors. It's a class of failure where the agent completes successfully, the code ships, and the system behaves wrongly for reasons that no compiler, linter, or type checker can detect. Understanding why this happens by design — not by accident — is prerequisite to building any reliable code agent workflow.

Prompt Injection Surface Area Mapping: Find Every Attack Vector Before Attackers Do

· 11 min read
Tian Pan
Software Engineer

Most teams discover their prompt injection surface area the wrong way: a security researcher posts a demo, a customer reports strange behavior, or an incident post-mortem reveals a tool call that should never have fired. By then the attack path is already documented and the blast radius is real.

Prompt injection is the OWASP #1 risk for LLM applications, but the framing as a single vulnerability obscures what it actually is: a family of attack vectors that scale with your application's complexity. Every external data source you feed into a prompt is a potential injection surface. In an agentic system with a dozen tool integrations, that surface area is enormous — and most of it is unmapped.

This post is a practitioner's methodology for mapping it before attackers do.

SLOs for Non-Deterministic Systems: Defining Reliability When Every Response Is Different

· 8 min read
Tian Pan
Software Engineer

Your AI feature returns HTTP 200, completes in 180ms, and produces valid JSON. By every traditional SLI, the request succeeded. But the answer is wrong — a hallucinated product spec, a fabricated legal citation, a subtly incorrect calculation. Your monitoring is green. Your users are furious.

This is the fundamental disconnect that breaks SRE for AI systems. Traditional reliability engineering assumes a successful execution produces a correct result. Non-deterministic systems violate that assumption on every request. The same prompt, same context, same model version can produce a different — and differently wrong — answer each time.

The Autonomy Dial: Five Levels for Shipping AI Features Without Betting the Company

· 11 min read
Tian Pan
Software Engineer

Most teams shipping AI features make the same mistake: they jump straight from "prototype that impressed the VP" to "fully autonomous in production." Then something goes wrong — a bad recommendation, an incorrect auto-response, a financial transaction that should never have been approved — and the entire feature gets pulled. Not dialed back. Pulled.

The problem is not that AI autonomy is dangerous. The problem is that most teams treat autonomy as a binary switch — off or on — when it should be a dial with distinct, instrumented positions between those two extremes.

The Forgetting Problem: When Unbounded Agent Memory Degrades Performance

· 9 min read
Tian Pan
Software Engineer

An agent that remembers everything eventually remembers nothing useful. This sounds like a paradox, but it's the lived experience of every team that has shipped a long-running AI agent without a forgetting strategy. The memory store grows, retrieval quality degrades, and one day your agent starts confidently referencing a user's former employer, a deprecated API endpoint, or a project requirement that was abandoned six months ago.

The industry has spent enormous energy on giving agents memory. Far less attention has gone to the harder problem: teaching agents what to forget.