Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The Cross-User Consistency Problem: When Your AI Gives Different Answers to the Same Question

· 9 min read
Tian Pan
Software Engineer

Two analysts at the same company both ask your AI assistant: "What was our Q3 churn rate?" One gets 4.2%. The other gets 4.8%. Neither is wrong — they just queried at different times, in different session contexts, against a retrieval index that ranked slightly different chunks. The AI answered both confidently, without hedging, without flagging the discrepancy. The analysts go into the same meeting with different numbers and your tool has just become a liability.

This is the cross-user consistency problem, and it's one of the most common reasons enterprise AI deployments quietly lose trust. The failure isn't a hallucination in the classic sense — no facts were invented. The failure is that your system is non-deterministic at scale, and that non-determinism is invisible until two users compare notes.

The Domain Expert Bottleneck in RAG: Why Knowledge Curation Breaks Production AI

· 7 min read
Tian Pan
Software Engineer

Most teams building RAG systems spend their first month on the pipeline — chunking strategy, embedding model selection, vector store configuration, retrieval tuning. They get that working. The demo passes. Stakeholders are impressed.

Then six months later, the system starts quietly degrading. Support tickets reference wrong procedures. The bot cites a pricing tier that was retired in Q3. A customer gets a confident answer about a product feature that was deprecated before they even signed up. The pipeline is fine. The knowledge base is the problem.

Ensemble vs. Debate: The Two Multi-Model Verification Paradigms and When Each Fails

· 9 min read
Tian Pan
Software Engineer

When a single LLM gives you the wrong answer, the instinct is to ask more models. Run three in parallel and take the majority — that's ensemble. Or put them in a room and let them argue it out — that's debate. Both feel rigorous. Both have peer-reviewed results behind them. And both fail in exactly the same way when the conditions aren't right, which is the part practitioners rarely discuss.

The failure mode isn't subtle: when all your models learned from the same data, carry the same biases, or were trained by people with the same worldview, asking more of them doesn't give you more signal. It gives you more confident noise. Recent research has put a number on this: the pairwise error correlation between top frontier models sits around r = 0.77. That means roughly 60% of error variance is shared. Three models from different providers are effectively 1.3 independent models, not 3.0.

Explanation Debt: Why Users Deserve to Know What Your AI Did

· 8 min read
Tian Pan
Software Engineer

A loan application gets rejected. A candidate gets filtered out of a hiring pipeline. A medical imaging tool flags a scan as abnormal. In each case, an AI system made a decision that matters—and the user has no idea why.

Teams building these systems often spent months tuning precision, recall, and output quality. They ran A/B tests, iterated on prompts, and shipped a model that gets the right answer 94% of the time. But they never built the layer that tells users what happened. This is explanation debt: the accumulated cost of shipping AI decisions without the attribution, confidence signals, and recourse affordances that make those decisions interpretable.

Gradual Context Replacement: Managing Long AI Conversations Without Losing Quality

· 9 min read
Tian Pan
Software Engineer

Your chatbot works perfectly for the first fifteen turns. Then something goes wrong. It contradicts an earlier decision. It asks for information the user already provided. It loses the thread of a multi-step task that was clearly defined at the start. The conversation history is technically all there—you haven't deleted anything—but the model is behaving as if it wasn't.

This is context rot: the gradual degradation of output quality as conversation histories grow. A 2024 evaluation of 18 state-of-the-art models across nearly 200,000 controlled calls found that reliability decreases significantly beyond 30,000 tokens, even in models with much larger nominal windows. High-performing models become as unreliable as much smaller ones in extended dialogues. The problem isn't that your context window ran out. It's that transformer attention is quadratic—100,000 tokens means 10 billion pairwise relationships—and the model is forced to distribute focus so thinly that important earlier content gets effectively ignored.

When teams hit this wall, they usually reach for one of two fixes: truncation or summarization. Both make things worse in predictable ways.

HIPAA, SOC2, and Your Agent: The Architectural Constraints Compliance Actually Imposes

· 12 min read
Tian Pan
Software Engineer

The typical AI team's encounter with compliance goes like this: the agent is in production, users love it, and someone from legal forwards an email asking whether the system is HIPAA-compliant. The engineer assigned to answer discovers that context windows contain PHI, that there are no audit logs with sufficient granularity, that the LLM provider doesn't have a signed Business Associate Agreement, and that the agent's tool permissions are broader than the minimum necessary standard allows. The fix takes three months and requires a partial rewrite.

This pattern is not an edge case. According to a 2024 industry survey, 78% of business executives cannot pass an AI governance audit within 90 days, and 42% of companies abandoned AI initiatives in 2025 primarily due to compliance and governance failures — not technical ones. The gap between what gets built and what compliance actually requires is architectural, and it forms in sprint one.

Human Override as a First-Class Feature: Designing AI Systems That Fail Gracefully to Human Control

· 10 min read
Tian Pan
Software Engineer

When an AI-powered customer support agent can't resolve an issue and escalates to a human, what happens next? In most systems: the customer is transferred cold, with no context, and must re-explain everything from the beginning. The human agent has no idea what the AI attempted, what information was collected, or why the handoff occurred.

This is the most common form of human override failure — not a dramatic AI meltdown, but a quiet UX collapse at the seam between automated and human handling. It happens because engineers built the AI path carefully and treated human takeover as an afterthought, a fallback for when things go wrong. The result is that override feels like a system error rather than a designed operational mode.

The engineering teams that get this right treat human override as a first-class feature from day one. Here's what that looks like in practice.

The Invisible Author Problem: Git Blame When AI Writes Most of Your Code

· 8 min read
Tian Pan
Software Engineer

When something breaks in production, the first thing engineers reach for is git blame. The commit hash links to a PR. The PR links to an author. The author links to context — a Slack thread, a design doc, a brain that remembers why the code was written that way. This chain is how teams debug incidents, conduct security audits, and accumulate institutional knowledge. It assumes that every line of code has a human author who understood what they were doing.

AI has quietly broken that assumption. Roughly 46% of code is now AI-generated, with Java shops pushing that figure past 60%. Most of that code carries no meaningful provenance metadata. The git blame chain still runs — it just now terminates at a developer who accepted a suggestion they may not have fully understood, with no record of the prompt, the model version, or the alternatives the AI rejected.

LLM-as-Judge Adversarial Failures: When Your Eval Harness Gets Gamed

· 9 min read
Tian Pan
Software Engineer

Your LLM-as-judge gave your new model a clean bill of health. Win rates are up, rubric scores improved across the board, and the automated eval pipeline ran green. Then you shipped — and user satisfaction dropped.

This is not an edge case. Researchers built constant-output "null models" that produce the exact same response regardless of input and gamed AlpacaEval 2.0 to an 86.5% length-controlled win rate. The verified state of the art at the time was 57.5%. When a model with no task capability at all can top your leaderboard, your eval harness has a problem that's worth understanding systematically.

PII in the Prompt: The Data Minimization Patterns Your AI Pipeline Is Missing

· 12 min read
Tian Pan
Software Engineer

Research from 2025 found that 8.5% of prompts submitted to commercial LLMs contain sensitive information — PII, credentials, and internal file references. That statistic probably undersells the problem. It counts what users explicitly type. It doesn't count what your system silently adds: retrieved customer records, tool outputs from database queries, memories persisted from previous sessions, or fine-tuning data that wasn't scrubbed before training. Most AI pipelines leak PII not through user mistakes but through architectural blind spots that no single engineer owns.

The failure mode is almost always the same: a team ships an AI feature thinking "we don't send personal data," but personal data enters through the seams — in the RAG retrieval chunk that includes a customer's address, in the agent tool output that returns a full user profile, in the fine-tuning dataset that was exported from a CRM without redaction. GDPR's data minimization principle requires that you collect only what's necessary for a specific purpose. LLM architectures violate this by default.

Privacy Mode That Actually Keeps Its Promise: Engineering User-Controlled Data Boundaries in AI Features

· 10 min read
Tian Pan
Software Engineer

In March 2026, a class action lawsuit alleged that Perplexity's "Incognito Mode" was routing conversational data and user identifiers to Meta and Google's ad networks — even for paying subscribers who had explicitly activated it. The feature was called incognito. Users assumed that meant private. The implementation said otherwise.

This is the most common failure mode in AI privacy modes: the name is marketing, the implementation is retention theater. Engineers ship a toggle. Legal approves the wording. Users flip the switch and trust it. And somewhere in the data pipeline, inputs are still flowing to a logging service, a training job, or a third-party analytics SDK that nobody remembered to gate.

Prompt Injection in Multimodal Inputs: The Visual Attack Surface Your Text-Only Defense Misses

· 11 min read
Tian Pan
Software Engineer

When teams harden their AI pipelines against prompt injection, they usually focus on text: sanitizing user input strings, scanning outputs for exfiltrated data, filtering known jailbreak patterns. That work matters, but it addresses roughly half the attack surface of a modern AI system. The other half lives inside images, PDFs, audio clips, and charts — formats that bypass every text-scanning rule you've written, because the model processes them through entirely different pathways than it processes text.

Steganographic injection attacks against vision-language models achieve success rates around 24% across production models including GPT-4V, Claude, and LLaVA. That number isn't a lab artifact. It measures real attack payloads, hidden in ordinary-looking images, causing production models to deviate from their intended behavior. Your text injection scanner doesn't see any of it.