Skip to main content

861 posts tagged with "insider"

View all tags

Closing the Feedback Loop: How Production AI Systems Actually Improve

· 12 min read
Tian Pan
Software Engineer

Your AI product shipped three months ago. You have dashboards showing latency, error rates, and token costs. You've seen users interact with the system thousands of times. And yet your model is exactly as good — and bad — as the day it deployed.

This is not a data problem. You have more data than you know what to do with. It is an architecture problem. The signals that tell you where your model fails are sitting in application logs, user sessions, and downstream outcome data. They are disconnected from anything that could change the model's behavior.

Most teams treat their LLM as a static artifact and wrap monitoring and evaluation around the outside. The best teams treat production as a training pipeline that never stops.

Context Poisoning in Long-Running AI Agents

· 9 min read
Tian Pan
Software Engineer

Your agent completes step three of a twelve-step workflow and confidently reports that the target API returned a 200 status. It didn't — that result was from step one, still sitting in the context window. By step nine, the agent has made four downstream calls based on a fact that was never true. The workflow "succeeds." No error is logged.

This is context poisoning: not a security attack, but a reliability failure mode where the agent's own accumulated context becomes a source of wrong information. As agents run longer, interact with more tools, and manage more state, the probability of this failure climbs sharply. And unlike crashes or exceptions, context poisoning is invisible to standard monitoring.

The HITL Rubber Stamp Problem: Why Human-in-the-Loop Often Means Neither

· 9 min read
Tian Pan
Software Engineer

There's a paradox sitting at the center of responsible AI deployment: the more you try to involve humans in reviewing AI decisions, the less meaningful that review becomes.

A 2024 Harvard Business School study gave 228 evaluators AI recommendations with clear explanations of the AI's reasoning. Human reviewers were 19 percentage points more likely to align with AI recommendations than the control group. When the AI also provided narrative rationales — when it explained why it made a decision — deference increased by another 5 points. Better explainability produced worse oversight. The human in the loop had become a rubber stamp on a form.

The Hybrid Automation Stack: A Decision Framework for Mixing Rules and LLMs

· 9 min read
Tian Pan
Software Engineer

Teams that replace all their Zapier flows and RPA scripts with LLM agents tend to discover the same thing six months later: they've traded brittle-but-auditable for flexible-but-unmaintainable. The Zapier flows broke in predictable ways—step 14 failed because the API changed. The LLM workflows break invisibly—the model quietly routes support tickets to the wrong queue, and nobody finds out until a customer escalates. The audit log says "AI decision," which is lawyer-speak for "no one knows."

The answer isn't to avoid LLMs in automation. It's to be deliberate about which tasks go to which system, and to architect the seam between them so failures don't cross over.

Latency Budgets for AI Features: How to Set and Hit p95 SLOs When Your Core Component Is Stochastic

· 11 min read
Tian Pan
Software Engineer

Your system averages 400ms end-to-end. Your p95 is 4.2 seconds. Your p99 is 11 seconds. You committed to a "sub-second" experience in the product spec. Every metric in your dashboard looks fine until someone asks what happened to 5% of users — and suddenly the average you've been celebrating is the thing burying you.

This is the latency budget problem for AI features, and it's categorically different from what you've solved before. When your core component is a database query or a microservice call, p95 latency is roughly predictable and amenable to standard SRE techniques. When your core component is an LLM, the distribution of response times is heavy-tailed, input-dependent, and partially driven by conditions you don't control. You need a different mental model before you can set an honest SLO — let alone hit it.

The Provider Reliability Trap: Your LLM Vendor's SLA Is Now Your Users' SLA

· 9 min read
Tian Pan
Software Engineer

In December 2024, Zendesk published a formal incident report stating that from June 10 through June 11, 2025, customers lost access to all Zendesk AI features for more than 33 consecutive hours. The engineering team's remediation steps were empty — there was nothing to do. The outage was caused entirely by their upstream LLM provider going down, and Zendesk had no architectural path to restore service without it.

This is the provider reliability trap in its clearest form: you ship a feature, make it part of your users' workflows, promise availability through implicit or explicit SLA commitments, and then discover that your entire reliability posture is bounded by a dependency you don't control, can't fix, and may not have formally evaluated before launch.

LLMs as ETL Primitives: AI in the Data Pipeline, Not Just the Product

· 9 min read
Tian Pan
Software Engineer

The typical AI narrative goes like this: you build a product, you add an AI feature, and users get smarter outputs. That framing is correct, but incomplete. The more durable advantage isn't in the product layer at all — it's in the data pipeline running underneath it.

A growing number of engineering teams have quietly swapped out regex rules, custom classifiers, and hand-coded parsers in their ETL pipelines and replaced them with LLM calls. The result: pipelines that handle unstructured input, adapt to schema drift, and classify records across thousands of categories — without retraining a model for every new edge case. Teams running this pattern at scale are building data assets that compound. Teams still treating LLMs purely as product features are not.

The Multi-Tenant Prompt Problem: When One System Prompt Serves Many Masters

· 9 min read
Tian Pan
Software Engineer

You ship a new platform-level guardrail — a rule that prevents the AI from discussing competitor pricing. It goes live Monday morning. By Wednesday, your largest enterprise customer files a support ticket: their sales assistant, which they'd carefully tuned to compare vendor options for their procurement team, stopped working. They didn't change anything. You changed something, and the blast radius hit them invisibly.

This is the multi-tenant prompt problem. B2B AI products that allow customer customization are actually running a layered instruction system, and most teams don't treat it like one. They treat it like string concatenation: take the platform prompt, append the customer's instructions, maybe append user preferences, and call the LLM. The model figures out the rest.

The model doesn't figure it out. It silently picks a winner, and you don't find out which one until someone complains.

Prompt Linting: The Pre-Deployment Gate Your AI System Is Missing

· 8 min read
Tian Pan
Software Engineer

Every serious engineering team runs a linter before merging code. ESLint catches undefined variables. Prettier enforces formatting. Semgrep flags security anti-patterns. Nobody ships JavaScript to production without running at least one static check first.

Now consider what your team does before shipping a prompt change. If you're like most teams, the answer is: review it in a PR, eyeball it, maybe test it manually against a few inputs. Then merge. The system prompt for your production AI feature — the instruction set that controls how the model behaves for every single user — gets less pre-deployment scrutiny than a CSS change.

This gap is not a minor process oversight. A study analyzing over 2,000 developer prompts found that more than 10% contained vulnerabilities to prompt injection attacks, and roughly 4% had measurable bias issues — all without anyone noticing before deployment. The tooling to catch these automatically exists. Most teams just haven't wired it in yet.

Schema Entropy: Why Your Tool Definitions Are Rotting in Production

· 10 min read
Tian Pan
Software Engineer

Your agent was working fine in January. By March, it started failing on 15% of tool calls. By May, it was silently producing wrong outputs on another 20%. Nothing in your deployment logs changed. No one touched the agent code. The tool definitions look exactly like they did six months ago — and that's the problem.

Tool schemas don't have to be edited to become wrong. The services they describe change underneath them. Enum values get added. Required fields become optional in a backend refactor. A parameter that used to accept strings now expects an ISO 8601 timestamp. The schema document stays frozen while the underlying API keeps moving, and your agent keeps calling it confidently, with no idea the contract has shifted.

This is schema entropy: the gradual divergence between the tool definitions your agent was trained to use and the tool behavior your production services actually exhibit. It is one of the most underappreciated reliability problems in production AI systems, and research suggests tool versioning issues account for roughly 60% of production agent failures.

The Selective Abstention Problem: Why AI Systems That Always Answer Are Broken

· 10 min read
Tian Pan
Software Engineer

Here is a pattern that appears in almost every production AI deployment: the team ships a feature that handles 90% of queries well. Then they start getting complaints. A user asked something outside the training distribution; the model confidently produced a wrong answer. A RAG pipeline retrieved a stale document; the model answered as though it were current. A legal query hit an edge case the prompt didn't cover; the model speculated its way through it. The fix, in each case, wasn't a better model. It was teaching the system to say "I don't know."

Abstention — the principled decision to not answer — is one of the hardest and most undervalued capabilities in AI system design. Virtually all product effort goes toward making answers better. Almost none goes toward making the system reliably know when to withhold one. That asymmetry is a design debt that compounds in production.

The Semantic Validation Layer: Why JSON Schema Isn't Enough for Production LLM Outputs

· 10 min read
Tian Pan
Software Engineer

By 2025, every major LLM provider had shipped constrained decoding for structured outputs. OpenAI, Anthropic, Gemini, Mistral — they all let you hand the model a JSON schema and guarantee it comes back structurally intact. Teams adopted this and breathed a collective sigh of relief. Parsing errors disappeared. Retry loops shrank. Dashboards turned green.

Then the subtle failures started.

A sentiment classifier locked in at 0.99 confidence on every input — gibberish included — for two weeks before anyone noticed. A credit risk agent returned valid JSON approving a loan application that should have been declined, with a risk score fifty points too high. A financial pipeline coerced "$500,000" (a string, technically schema-valid) down to zero in an integer field, corrupting six weeks of risk calculations. Every one of these failures passed schema validation cleanly.

The lesson: structural validity is necessary, not sufficient. You need a semantic validation layer, and most teams don't have one.