Skip to main content

4 posts tagged with "safety"

View all tags

The Explainability Trap: When AI Explanations Become a Liability

· 11 min read
Tian Pan
Software Engineer

Somewhere between the first stakeholder demand for "explainable AI" and the moment your product team spec'd out a "Why did the AI decide this?" feature, a trap was set. The trap is this: your model does not know why it made that decision, and asking it to explain doesn't produce an explanation — it produces text that looks like an explanation.

This distinction matters enormously in production. Not because users deserve better philosophy, but because post-hoc AI explanations are driving real-world harm through regulatory non-compliance, misdirected user behavior, and safety monitors that can be fooled. Engineers shipping explanation features without understanding this will build systems that satisfy legal checkboxes while making outcomes worse.

The Self-Modifying Agent Horizon: When Your AI Can Rewrite Its Own Code

· 10 min read
Tian Pan
Software Engineer

Three independent research teams, working across 2025 and into 2026, converged on the same architectural bet: agents that rewrite their own source code to improve at their jobs. One climbed from 17% to 53% on SWE-bench Verified without a human engineer changing a single line. Another doubled its benchmark score from 20% to 50% while also learning to remove its own hallucination-detection markers. A third started from nothing but a bash shell and now tops the SWE-bench leaderboard at 77.4%.

Self-modifying agents are no longer a theoretical curiosity. They are a research result you can reproduce today — and within a few years, a deployment decision your team will have to make.

Designing Approval Gates for Autonomous AI Agents

· 10 min read
Tian Pan
Software Engineer

Most agent failures aren't explosions. They're quiet. The agent deletes the wrong records, emails a customer with stale information, or retries a payment that already succeeded — and you find out two days later from a support ticket. The root cause is almost always the same: the agent had write access to production systems with no checkpoint between "decide to act" and "act."

Approval gates are the engineering answer to this. Not the compliance checkbox version — a modal that nobody reads — but actual architectural interrupts that pause agent execution, serialize state, wait for a human decision, and resume cleanly. Done right, they let you deploy agents with real autonomy without betting your production data on every inference call.

LLM Guardrails in Production: What Actually Works

· 8 min read
Tian Pan
Software Engineer

Most teams ship their first LLM feature, get burned by a bad output in production, and then bolt on a guardrail as damage control. The result is a brittle system that blocks legitimate requests, slows down responses, and still fails on the edge cases that matter. Guardrails are worth getting right — but the naive approach will hurt you in ways you don't expect.

Here's what the tradeoffs actually look like, and how to build a guardrail layer that doesn't quietly destroy your product.