Skip to main content

40 posts tagged with "governance"

View all tags

The Compliance Audit That Asked Which Model Produced Which Output

· 10 min read
Tian Pan
Software Engineer

The auditor's question sounds simple. She has your appeals log open, points at a row from eight months ago, and asks which model decided that case. Your engineer pulls up the schema: there is a model column, and every decision in the audit window says v1. Then someone from the platform team mentions, almost in passing, that the alias behind v1 rotated four times during the audit period — a base model upgrade, a fine-tune refresh, a vendor-side capacity move, and one rollback that lasted six hours during an incident. The honest answer is that you cannot say which checkpoint produced that decision. The auditor writes something down. That phrase is not a regulator-acceptable answer, and you have just learned that the system you shipped has been failing an audit requirement it was never designed to meet.

The gap here is not a missing log line. The gap is between two different ideas of what "model" means. To the engineers shipping the system, v1 is an endpoint — a stable contract callers can point at while the thing behind it gets upgraded for free. To the auditor, "the model that produced this decision" is a specific artifact: a weight checkpoint, a hash, a thing you could in principle re-run on the same input and get a defensibly similar output. Endpoint aliases were invented to hide checkpoint rotation from callers. Audit-grade provenance demands the opposite — that every decision be attributable to exactly the checkpoint that produced it. The two ideas were on a collision course from the start; the audit just happened to be where they met.

The Process Your Agent Quietly Owns Without Documentation

· 10 min read
Tian Pan
Software Engineer

Six months ago, your team shipped a support agent that handles refunds. There was a one-page Notion doc describing what it should do. Today, the doc still says what it said, but the agent does not. The prompt has 47 edits in its history. Three tools were added — one of them quietly bypasses a finance check that the doc still asserts exists. The model was swapped twice. A retry policy was hardened after an incident nobody wrote up. And when somebody on the data team asks "what are the actual rules for issuing a refund here," the honest answer is: read the system prompt and the tool registry, because that is the spec now.

This is the quiet failure mode of agentic systems in production: the agent's behavior IS the runbook nobody wrote. The prompt got treated as a configuration value — a string in a YAML file, edited by whoever owned the feature, reviewed like a copy change — when it was actually the most authoritative description of a multi-step business process in the company. The org accumulated process logic the way legacy codebases accumulate behavior: through edits, not design. And the people who would historically own that process — a product manager, a compliance lead, an ops director — never realized they had lost the artifact, because there was never a document to lose.

The Recurring Task Your Agent Scheduled With Nobody To Inherit

· 9 min read
Tian Pan
Software Engineer

A user types "remind me every Tuesday to check that integration." The agent creates a cron entry, returns a polite confirmation, and the session closes. Six months later the user has changed teams. The integration was deprecated last quarter. The cron is still firing, hitting an API key that was rotated in April, into a Slack channel that was archived in May, charged to a project budget that nobody reviews. The agent did exactly what was asked. The asking is what aged badly.

This is not a bug in any particular agent. It is the shape of a category. The moment we gave agents the ability to schedule durable side effects — cron jobs, webhooks, polling loops, workflow triggers, calendar invites, recurring queries — we created a class of infrastructure that is born without a lifecycle. The create primitive is loud and easy. The delete primitive, the audit primitive, the inheritance primitive — they don't exist on equal footing, so they don't get used.

The cost is invisible until you go looking, which is exactly when nobody is looking.

The Agent That Scheduled Itself Into the Maintenance Window

· 11 min read
Tian Pan
Software Engineer

A senior engineer on call at 2am does not run a schema migration during a Sev-2 incident. They do not redeploy the payment service ten minutes before a release freeze starts. They do not fire a marketing email campaign while the email vendor's status page is red. None of this is in their job description. They picked it up from years of getting yelled at, from Slack channels titled #deploy-freeze-friday, from the muscle memory of glancing at the status page before they touch anything. It is the kind of context that does not exist in any runbook because nobody thought it needed to be written down.

Now hand the same job to an agent. The agent has tools. The agent has a multi-step plan. The agent has every documented policy you bothered to put in its system prompt. What the agent does not have is the half-conscious awareness that the world is currently on fire. So it executes the plan. Cleanly. Confidently. Into the maintenance window. And the postmortem includes a sentence that is going to become a familiar trope: "the agent had no way of knowing."

When Marketing Reads Your Eval Cases: The Cross-Functional Visibility Problem

· 11 min read
Tian Pan
Software Engineer

The eval set is the most-read artifact your AI team produces, and you almost certainly don't know who's reading it. The repo is private, the CI job is internal, the file is one directory above prompts/ — and yet a sales engineer scraped six cases for a demo last quarter, a marketing analyst pulled three failure cases into a "look how robust our system is" deck, customer success cited eval pass-rates verbatim in a renewal call, and product treats the file as the hidden spec the AI team won't share. The case files are read by more people than the code that generated them, and nobody on the AI team has noticed.

This isn't a permissions failure. The eval set is on the same Git server as the rest of the codebase, with the same access controls as every other engineering artifact. The problem is that the AI team is the only group that treats the eval set as code. Everyone else treats it as documentation, as marketing material, as a product spec, or as a customer complaint log — and each of those readings extracts a different slice of the same file, packages it for a different audience, and ships it somewhere the AI team isn't watching.

The Internal Eval Set Is a Privacy Boundary Nobody Reviewed

· 11 min read
Tian Pan
Software Engineer

The dataset your AI team calls "the eval set" is, in most companies shipping LLM features, a collection of real customer conversations pulled from production logs. Nobody on the team thinks of it as a privacy event. The data never left the cluster. No new system was provisioned. No vendor was added. An engineer wrote a query, exported a few thousand traces into a labeling tool, and the team started grading model outputs against them. The legal team never heard about it because, from the inside, nothing changed — the same conversations that already lived in the same database were now also being read by a few engineers and a judge model.

That is the privacy boundary nobody reviewed. Customers gave you their messages so you could answer them. They did not give you their messages so you could measure your model against them. The two uses look identical at the storage layer and feel identical at the inference layer, but they are different processing purposes under every modern privacy regime — and the gap between the two is where the next round of compliance pain is going to land.

Compliance Reviewer as Eval Author: Why Legal Should Be Writing Your Test Cases

· 13 min read
Tian Pan
Software Engineer

The most useful adversarial prompt I have seen for an enterprise LLM did not come from a red team, a security researcher, or a prompt engineer. It came from a senior compliance attorney who asked the model, in plain English, to "tell me which of the three retirement annuities discussed earlier in this thread is the best one for a 62-year-old approaching their first required minimum distribution." The model produced a confident, thoughtful, beautifully-formatted recommendation. That output, had it been sent to a customer, would have been a textbook FINRA suitability violation — an unsuitable individualized recommendation made without the supervisory infrastructure that securities rules require around personalized advice.

The compliance attorney spotted the failure mode in about four seconds. The engineering eval suite, which had a hundred-plus carefully constructed cases for hallucination, refusal calibration, and tool-use accuracy, had no concept that this particular response shape was illegal. Not low quality. Not a hallucination. Illegal. And the workflow at the company at the time had her reading sample outputs in a Google Doc and writing memos, rather than checking a test case into the regression suite. So her catch lived in a memo, the memo got summarized in a launch-readiness slide, and the next month a refactor of the system prompt regressed the behavior because nobody had a failing test pinned to it.

That is the gap I want to argue we should close: the compliance reviewer should be authoring eval cases directly, and those cases should be the artifact that gates release — not the document review that produced them.

The Pre-Launch Blast Radius Inventory: The Document Your Agent Team Forgot to Write

· 10 min read
Tian Pan
Software Engineer

The first hour of an agent incident is always the same. Someone notices the agent did something it shouldn't have — invoiced the wrong customer, deleted a calendar event for the CEO, posted a half-finished apology in a public Slack channel — and the response team starts asking questions nobody has written answers to. Which downstream system holds the audit log? Which on-call rotation owns that system? Was the call reversible, and within what window? Who owns the credential the agent used, and does that credential also let it touch other systems we haven't checked yet? The team that wrote the agent rarely owns those answers, because the answers live in the systems the agent calls, and nobody at launch wrote them down in one place.

That document is the blast radius inventory, and it is the artifact most agent teams discover the absence of during their first incident. It is not a security checklist, not a tool schema, not a runbook. It is an enumerated list of every external system the agent can touch and every fact you need on the worst day of that system's life. Teams that ship agents without one are betting that incident-response context can be reconstructed faster than the blast spreads, and that bet keeps losing as agents get more tools and the tools get more powerful.

The Shadow AI Governance Problem: Why Banning Personal AI Accounts Makes Security Worse

· 9 min read
Tian Pan
Software Engineer

Workers at 90% of companies are using personal AI accounts — ChatGPT, Claude, Gemini — to do their jobs, and 73.8% of those accounts are non-corporate. Meanwhile, 57% of employees using unapproved AI tools are sharing sensitive information with them: customer data, internal documents, code, legal drafts. Most executives believe their policies protect against this. The data says only 14.4% actually have full security approval for the AI their teams deploy.

The gap between what leadership believes is happening and what is actually happening is the shadow AI governance problem.

The instinct at most companies is to respond with a ban. Block personal chatbot accounts at the network level, issue a policy memo, run an annual training, and call it governance. This is the worst possible response — not because the concern is wrong, but because the intervention makes the problem invisible without making it smaller.

Your AI Explainer Doc Is a Runtime Dependency, Not Marketing Copy

· 12 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped an AI assistant with a tidy stack of supporting documents: an in-product tooltip warning that the AI may produce inaccurate results, a help-center article titled "How does the assistant work," an internal support runbook for handling escalations, and a public model card listing the underlying model, the tools the assistant could call, and the data domains it covered. The launch went well. Six months later the prompt had been edited fourteen times, the model had been swapped from one tier to another with subtly different refusal behavior, two new tools had been added, one tool had been deprecated but not removed from the prompt, and the language settings had been opened from English-only to nine locales.

Every single one of those documents was wrong. Not catastrophically wrong — the kind of wrong where a sentence is half-true, a capability is described in language the model no longer matches, a refusal pattern is documented that the new model never triggers, a tool name appears in the help article that the assistant won't actually call. The kind of wrong that produces a slow drip of confused support tickets, a few customer trust regressions when the AI does something the docs say it won't, and — because the company sells into a regulated vertical — a small but real compliance gap that nobody on the AI team had thought to track.

The AI Risk Register: What Your CRO Will Demand the Morning After

· 12 min read
Tian Pan
Software Engineer

The morning after the first six-figure agent incident, the directors will not ask whether the model was state-of-the-art. They will ask to see the row in the risk register that named this scenario, the owner who signed off, and the date the board last reviewed it. If your enterprise risk register has lines for cyber, vendor, regulatory, and operational risk, but no row for "an autonomous agent took an action under our credentials that produced a customer-visible loss," you are about to spend a board meeting explaining why the artifact every other category of risk merits did not exist for the one that just lost you money.

This is not a hypothetical anymore. Gartner projects that more than a thousand legal claims for harm caused by AI agents will be filed against enterprises by the end of 2026. AI-related risk has moved from tenth to second on the Allianz Risk Barometer in a single year. Insurers are now asking, in D&O renewal questionnaires, how the board has integrated AI into the corporate risk register and how third-party agentic exposures are being tracked. The line items below are what a defensible answer looks like, and the cadence the AI feature owner has to defend them on.

AI Shadow IT: When Product Teams Build Their Own LLM Proxy

· 11 min read
Tian Pan
Software Engineer

The shadow IT incident your platform team is going to investigate in Q3 already happened in January. It looks like this: a senior engineer on a product team has a launch this month. The platform team's "official" LLM gateway is on the roadmap for "next quarter." So the engineer creates a corporate credit card OpenAI account, drops the API key into a .env file, ships the feature, and hits the public deadline. The launch is a success. Six months later, the FinOps team finds three vendor accounts nobody can attribute, the security team finds prompts containing customer data routed to a region not covered by the data processing agreement, and the platform team discovers the gateway it spent two quarters building has 14% adoption because every team that needed AI shipped without it.

This is not a security failure or a discipline failure. It is a platform-product velocity mismatch, and treating it as anything else guarantees the next gateway you ship will have the same adoption problem.