Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The AI Onboarding Gap: Why Engineers Can't Learn What They Can't Test

· 11 min read
Tian Pan
Software Engineer

A new engineer joins an AI-heavy team. On their third day, they see a prompt with an awkward double negation in the system instructions. It looks like a bug. They clean it up — the kind of small polish any reasonable person would do. Two hours later, customer-facing classification accuracy on a critical pipeline drops from 91% to 74%. Nobody has any idea why.

This scenario plays out in some form at almost every team building on LLMs. The new engineer isn't careless. The prompt did look wrong. But that double negation was load-bearing in a way that only the person who wrote it — after weeks of experimentation — actually understood. And they never wrote that understanding down.

This is the AI onboarding gap: the chasm between what an AI codebase appears to do and what it actually does, and why that gap is invisible until someone falls into it.

AI Pipeline Exception Handling: Hallucinations, Refusals, and Format Violations Are First-Class Errors

· 10 min read
Tian Pan
Software Engineer

Your AI pipeline reported zero errors last night. The output was completely wrong.

That's not a hypothetical. A recent industry report found that roughly 1 in 20 production LLM requests fail in ways that never surface as exceptions — valid HTTP 200, well-formed JSON, fluent prose, factually wrong. The observability stack stays green while the pipeline quietly lies to its users.

The root cause is an architectural assumption borrowed from traditional service engineering: that HTTP status codes and parse errors cover the failure space. They don't. LLM pipelines have at least four failure types that the underlying infrastructure cannot see — hallucinations, refusals, format violations, and context overflow — and treating them as edge cases instead of first-class error types is how production AI systems ship invisible bugs at scale.

Your AI Product's Dark Energy: The Background Compute Nobody Budgeted

· 10 min read
Tian Pan
Software Engineer

When your AI feature ships, you build a latency budget: how long does the model call take, how long does retrieval take, what's the p99 for the full request. What you almost certainly don't build is a budget for the inference that happens when no user is watching.

Every AI product with persistent state runs invisible work in the background. Documents get preprocessed when uploaded. Long conversations get re-summarized at session boundaries so the next session doesn't blow the context window. Proactive suggestions get generated on a schedule nobody set deliberately. Embeddings get regenerated when someone updates the schema. None of this shows up in your latency dashboard. Frequently it isn't in your cost model. Almost never is it in your monitoring.

This is your AI product's dark energy — the compute that explains the gap between what your inference bill should be and what it actually is.

Why Your Application Logs Can't Reconstruct an AI Decision

· 11 min read
Tian Pan
Software Engineer

An AI system flags a job application as low-priority. The candidate appeals. Legal asks engineering: "Show us exactly what the model saw, which documents it retrieved, which policy rules fired, and what confidence score it produced." Engineering opens the logs and finds: a timestamp, an HTTP 200, a response body, and a latency metric. The rest is gone.

This is not a logging failure. The logs are complete by every traditional measure. The problem is that application logs were never designed to record reasoning — and AI systems don't just execute code, they make context-dependent probabilistic decisions that can only be understood given the full input context that existed at decision time.

The Boring AI Manifesto: Why a Single Prompt Outperforms Your Autonomous Agent

· 9 min read
Tian Pan
Software Engineer

Here's an uncomfortable fact: 80% of AI projects fail to deliver business value, yet teams keep reaching for the most complex solution available. A multi-agent orchestration system with tool-calling, memory retrieval, and autonomous planning makes for a compelling demo. A single prompt that routes customer support tickets to the right queue makes your company $2M in the first year. These two outcomes are not equally likely, and they are not equally common, and the industry has been choosing the wrong one.

The pattern is predictable. An engineering team builds something impressive, demos it for leadership, gets approval to ship it — and then watches it silently degrade in production. Meanwhile, a competitor quietly deploys a two-hundred-line Python script wrapping a classifier, never demos it, and outperforms them on every business metric that matters.

Chunking for Agents vs. RAG: Why One Strategy Breaks Both

· 9 min read
Tian Pan
Software Engineer

Most teams pick a chunk size, tune it for retrieval quality, and call it done. Then they build an agent on the same index and wonder why the agent fails in strange ways — it executes half a workflow, ignores conditional logic, or confidently acts on incomplete instructions. The chunk size that maximized your NDCG score is exactly what's making your agent unreliable.

RAG retrieval and agent execution are not the same problem. They have different goals, different failure modes, and fundamentally different definitions of what a "good chunk" looks like. When you optimize chunking for one, you systematically degrade the other. Most teams don't realize this until they've already built on the wrong foundation.

Code Ownership Decay: What Happens to Team Knowledge When AI Writes Most Commits

· 9 min read
Tian Pan
Software Engineer

When a bug surfaces in production, the first ritual is the same: open git blame, find who wrote the line, ask them why. That ritual assumes the author had a reason — a constraint they knew, an edge case they handled deliberately, a business rule they'd internalized from three quarters of postmortems. For most of software history, git blame answered a question about intent.

Now, for a growing share of commits, git blame points to a human who merged the code and an AI that generated it. The human may have spent 90 seconds reading the diff. The AI had no context beyond the prompt. The "why" — the institutional knowledge that made git blame useful — was never written down anywhere.

This is code ownership decay. It doesn't announce itself. No single commit breaks the system. Instead, understanding slowly hollows out until the team reaches a decision point — a refactor, an incident, a new hire ramping up — and discovers that nobody can explain the system from the inside anymore.

The Compound Hallucination Problem: How Multi-Stage AI Pipelines Amplify Errors

· 10 min read
Tian Pan
Software Engineer

Most hallucination research focuses on what comes out of a single model call. That framing misses the scarier problem: what happens in a four-stage pipeline where each stage unconditionally trusts the previous output. A single hallucinated fact in Stage 1 doesn't just persist—it becomes the load-bearing premise for every subsequent inference. By Stage 4, the pipeline delivers a confident, internally coherent answer that happens to be entirely wrong.

This isn't a capability problem that better models will solve. It's a systems architecture problem, and it requires a systems-level fix.

Context Compression Artifacts: What Your Summarization Middleware Is Silently Losing

· 10 min read
Tian Pan
Software Engineer

Your agent said "Do NOT use eval()" at turn three. By turn thirty, it called eval(). Your insurance processor said "Never approve claims without valid ID." After fifteen compression cycles, it approved one. These aren't model failures — they're compression failures. The agent's reasoning was fine. The summarization middleware threw away the one constraint that mattered.

Context compression is now standard infrastructure in long-running agent systems. When conversation history grows too large for the context window, you compress it — roll up older turns into a summary, trim, chunk, or distill. The problem is that modern summarizers don't destroy information randomly. They destroy it predictably, along specific fault lines, and most teams only discover those fault lines in production.

The Context Window Is an API Surface: Treat Your Prompt Structure as a Contract

· 9 min read
Tian Pan
Software Engineer

Six months into a production LLM feature, an engineer files a bug: the model started giving incorrect output sometime last quarter. Nobody remembers changing the prompt. The git blame shows it was "cleaned up for readability." The previous version is gone. Debugging begins from scratch.

This is the moment teams discover that their context window was never really engineered — it was just assembled.

The context window is the contract between your system and the model. Every token that enters it — system instructions, retrieved documents, conversation history, tool schemas, the user query — is input to a function call that costs money, takes time, and produces non-deterministic output. Yet most teams treat context composition as an implementation detail rather than an API surface. Prompts get edited in place, without versioning. Sections grow by accumulation. Nobody owns the layout. Changes propagate silently. The debugging experience is worse than anything from the pre-LLM era, because at least stack traces tell you what changed.

The Data Flywheel Assumption: When AI Features Compound and When They Just Accumulate Noise

· 9 min read
Tian Pan
Software Engineer

Every AI pitch deck includes a slide about the data flywheel. The story is appealing: users interact with your AI feature, that interaction generates data, the data trains a better model, the better model attracts more users, and the cycle repeats. Scale long enough and you have an insurmountable competitive moat.

The problem is that most teams shipping AI features don't have a flywheel. They have a log file. A very large, expensive-to-store log file that has never improved their model and never will—because the three preconditions for a real flywheel are missing and nobody has asked whether they're present.

Data-Sensitivity-Tier Model Routing: Governing Which Model Sees Which Data

· 11 min read
Tian Pan
Software Engineer

Your AI system routed a patient query to a self-hosted model at 9 AM. At 11 AM, that model's pod restarted during a deployment. The request queue backed up, the router detected a timeout, and it fell back to the cloud LLM you use for generic queries. The query completed successfully. No alerts fired. Your monitoring dashboard showed green. Somewhere in that exchange, protected health information traveled to a vendor with whom you have no Business Associate Agreement.

That's not a hypothetical. It's the default behavior of nearly every AI routing stack that wasn't explicitly designed to prevent it.