Skip to main content

42 posts tagged with "governance"

View all tags

The AI Risk Register: What Your CRO Will Demand the Morning After

· 12 min read
Tian Pan
Software Engineer

The morning after the first six-figure agent incident, the directors will not ask whether the model was state-of-the-art. They will ask to see the row in the risk register that named this scenario, the owner who signed off, and the date the board last reviewed it. If your enterprise risk register has lines for cyber, vendor, regulatory, and operational risk, but no row for "an autonomous agent took an action under our credentials that produced a customer-visible loss," you are about to spend a board meeting explaining why the artifact every other category of risk merits did not exist for the one that just lost you money.

This is not a hypothetical anymore. Gartner projects that more than a thousand legal claims for harm caused by AI agents will be filed against enterprises by the end of 2026. AI-related risk has moved from tenth to second on the Allianz Risk Barometer in a single year. Insurers are now asking, in D&O renewal questionnaires, how the board has integrated AI into the corporate risk register and how third-party agentic exposures are being tracked. The line items below are what a defensible answer looks like, and the cadence the AI feature owner has to defend them on.

AI Shadow IT: When Product Teams Build Their Own LLM Proxy

· 11 min read
Tian Pan
Software Engineer

The shadow IT incident your platform team is going to investigate in Q3 already happened in January. It looks like this: a senior engineer on a product team has a launch this month. The platform team's "official" LLM gateway is on the roadmap for "next quarter." So the engineer creates a corporate credit card OpenAI account, drops the API key into a .env file, ships the feature, and hits the public deadline. The launch is a success. Six months later, the FinOps team finds three vendor accounts nobody can attribute, the security team finds prompts containing customer data routed to a region not covered by the data processing agreement, and the platform team discovers the gateway it spent two quarters building has 14% adoption because every team that needed AI shipped without it.

This is not a security failure or a discipline failure. It is a platform-product velocity mismatch, and treating it as anything else guarantees the next gateway you ship will have the same adoption problem.

The Customer Record Hiding in Your Few-Shot Prompt Template

· 11 min read
Tian Pan
Software Engineer

The privacy auditor's question came two days before the SOC 2 renewal: "Why is the email field in your onboarding prompt's example a real customer address?" The product team rebuilt the chain in their heads. A year earlier, when they shipped the AI summarizer, someone needed a "see how this works" example for the few-shot template. They picked a representative customer record from staging, scrubbed the obvious fields — name, account ID, phone — and committed the file. The customer churned six months later. Their record was deleted from the database per the data retention policy. Their record was not deleted from the prompt template, which had been shipped to every tenant in production.

The team had assumed, like most teams, that the privacy boundary was the database. The prompt template was code. Code goes through review. Review doesn't flag PII because reviewers aren't looking for it in YAML strings labeled example_input:. The DLP scanner that catches PII in Slack messages and email attachments doesn't scan committed code, and even if it did, it wouldn't recognize a partially-scrubbed customer record as personal data because the fields it knew to look for had been removed. Everything that remained — the company size, the industry, the rare job title, the specific city — was data the scanner had no rule for.

The Two-PM Problem: When Prompt Ownership and Product Ownership Drift Apart

· 11 min read
Tian Pan
Software Engineer

A support ticket lands on Tuesday morning: a customer was given a confidently wrong answer about their refund window. Engineering pulls the trace and finds the model picked the wrong intent. The product PM looks at the dashboard and sees the new "express refund" affordance — shipped last sprint — surfaced an intent the prompt was never tuned to handle. The platform PM points at the eval suite, which is green. Both are technically right. The customer is still wrong.

This is the two-PM problem, and most AI teams have it without naming it. The product PM owns the user-facing surface — intents, success metrics, the support escalation path. The platform or ML PM owns the prompt, the model choice, the eval suite, and the cost ceiling. The roadmaps are coordinated at the quarterly-planning level and drift at the weekly-shipping level, because the two PMs are optimizing for different metrics on different dashboards with different change-control processes.

The interesting failure mode isn't that the two PMs disagree. It's that they ship correctly relative to their own scope and still produce a regression nobody owns.

AI Cyber Insurance: The Coverage Gap Your Agent Will Find First

· 11 min read
Tian Pan
Software Engineer

A coding agent merges a change at 2 a.m. that takes a customer's production database offline for ninety minutes. A customer-support agent fans out and sends fourteen thousand misworded refund-denial emails before the loop is killed. An autonomous reconciliation workflow charges 2,800 cards twice. The damages are real, the audit trail names your company, and your finance team files the claim against the cyber policy that was renewed six weeks ago. The carrier's response is a polite letter explaining that the policy covers "unauthorized access by malicious third parties" and "social engineering of an employee" — and the agent was authenticated, the action was authorized, and no employee was deceived. Coverage denied. The loss sits on your balance sheet.

This is not a hypothetical edge case. It is the modal claim profile for the next eighteen months, and the insurance industry knows it. Cyber, E&O, and D&O policy language was calibrated against a threat model where breach severity is a function of records exfiltrated and incident response is a function of forensic hours billed. Agentic AI does not produce that shape of incident. It produces a shape the underwriter has no actuarial baseline for, and the carrier's first instinct — when the actuarial baseline is missing — is to write the exposure out of the policy entirely.

Your Agent Has Two Release Pipelines, Not One

· 10 min read
Tian Pan
Software Engineer

A team I worked with shipped a "small prompt tweak" on a Wednesday afternoon. The same PR also added one new tool to the agent's registry — a convenience wrapper around an internal admin API that the prompt would now occasionally invoke. The eval suite passed. The canary looked clean. By Thursday morning a customer's billing record had been mutated by an agent acting on a prompt-injected support ticket, the audit trail showed the admin tool firing exactly as designed, and the on-call engineer's first instinct — roll back the prompt — did nothing useful, because the credential had already been used and the row had already been written.

The post-mortem framed it as a security review failure. It wasn't. It was a release-pipeline failure. The team had shipped two completely different asset classes — a behavioral nudge to the model and a new authority granted to the agent — through the same review, the same gate, and the same rollback story, as if they were the same kind of change. They aren't. And once you see them as two pipelines, most "agent governance" debates become much less mysterious.

Eval-Author Monoculture: Why Your Benchmark Becomes a Self-Portrait

· 11 min read
Tian Pan
Software Engineer

Green CI is not the statement "this prompt works." Green CI is the statement "the engineer who wrote the evals could not think of how this prompt should break." Those are very different claims, and the gap between them is where your production incidents live. An eval suite is not a measurement of your model — it is a frozen portrait of whoever wrote it. Their dialect, their domain knowledge, their seniority, their pet failure modes, the model they happened to be using when they wrote the test cases. Everything that engineer would not think to test is, by construction, untested. And worse: they will keep extending the suite from the same vantage point, so the blind spot does not shrink as the suite grows. It calcifies.

This is the eval-author monoculture problem, and it is the most under-discussed reliability risk in AI engineering today. Teams obsess over judge bias, position bias, verbosity bias, leakage, and contamination — but the upstream bias is the bias of the human who decided what the test cases should be in the first place. Every other source of eval error gets amplified by it. If your suite was written by one person, you have a benchmark with a personality, and that personality is the silent ceiling on what your CI can ever catch.

Your Eval Rubric Is the Real Product Spec — and No PM Signed Off on It

· 11 min read
Tian Pan
Software Engineer

A product manager writes a paragraph: "The assistant should be helpful, accurate, and concise, and should never make the customer feel rushed." An engineer reads that paragraph, opens a YAML file, and writes 47 weighted criteria so the LLM-as-judge can produce a number on every trace. Six months later, that YAML file is the actual specification of the product. Every release is gated on it. Every regression alert fires on it. Every "this is shipping quality" decision routes through it. The PM has never read it.

This is the most common form of unintentional product ownership transfer in AI engineering today. The rubric is not a measurement of the spec — it is the spec, in the same way that a compiler is not a description of your language but the operational truth of it. And like compilers, rubrics have implementation details that silently determine semantics. Which failure mode gets a 0 versus a 0.5? Which criteria is weighted 0.3 versus 0.05? Which behavior is absent from the rubric and therefore goes uncounted entirely? Each of these is a product decision. None of them lived in the original brief.

Found Capabilities: When Users Ship Features Your Team Never Roadmapped

· 10 min read
Tian Pan
Software Engineer

A customer emails support to ask why your CRM agent stopped drafting their NDAs. You did not know your CRM agent drafted NDAs. A power user complains that your support bot's Tagalog translations have gotten worse since last week. You did not know your support bot did Tagalog. A forum thread spreads a prompt that turns your code-review assistant into a passable security scanner, and within a quarter you are getting CVE reports filed against findings the assistant produced. Each of these is a feature with adoption, business impact, and zero institutional ownership — no eval, no SLA, no surface in the UX, no roadmap entry, and a quiet bus factor of one: the customer who figured it out.

This is what happens once your product is wrapped around a model whose capability surface is wider than the surface you scoped. Users explore the wider surface, find behaviors that solve their problems, build workflows on top of those behaviors, and then experience your next model upgrade as a regression even though nothing on your roadmap moved. The contract between you and your users is no longer the one you wrote down. It includes everything the model happened to do for them that you happened not to break.

Treating this as an engineering surprise — "we will harden the prompt, we will add a guardrail, we will catch it next time" — is a category error. Found capabilities are a product-management problem. The discipline is not preventing them; it is detecting them, deciding what to do with them, and remembering that you decided.

The Internal LLM Gateway Is the New Service Mesh

· 10 min read
Tian Pan
Software Engineer

Walk into any company with fifty engineers writing LLM code in production and you will find seven gateway-shaped artifacts. The recommendations team built one to route between OpenAI and Anthropic. The support-bot team wrote one to attach their prompt registry. The platform team has a half-finished proxy that handles auth but not rate limiting. The growth team has a Lambda that does PII redaction on its way out. The data-science team is calling the vendor SDK directly and nobody has told them to stop. There is no shared gateway. There are seven shared problems, each solved poorly in isolation, and a CFO who is about to ask why the AI bill grew 40% quarter over quarter with no clear owner for any of it.

This is the same architectural beat the industry hit with microservices in 2016 and 2017. A thousand external dependencies, the same shared concerns at every team — auth, retries, observability, policy — and a choice between solving them once or rediscovering them everywhere. The answer then was the service mesh. The answer now is the internal LLM gateway, and most companies are still in the rediscovering-everywhere phase.

Prompt-Eligibility: The Missing Column in Your Data Classification

· 11 min read
Tian Pan
Software Engineer

Pull up your company's data classification policy. Public, internal, confidential, restricted — four neat tiers, each mapped to a set of access controls and a list of approved storage locations. Now ask a question the policy was never written to answer: which of these tiers are allowed to leave the corporate perimeter as a token sequence sent to a third-party model API?

The answer is almost always silence. Not because the policy is wrong, but because it is incomplete. Every classification scheme in use today was designed for an access vector that asks "is this employee allowed to read this row?" The prompt layer introduced a different vector entirely: an authorized service reads the row, transforms it into a prompt, and ships it across the network to a vendor that may log it, train on it, or hold it in plaintext for thirty days. None of that is read-access. None of it is covered.

This is the missing column. Until you add it, your data classification document is confidently asserting a control posture you do not have.

Shadow MCP: The Tool Servers Your Security Team Has Never Heard Of Are Already Running on Your Engineers' Laptops

· 13 min read
Tian Pan
Software Engineer

Your security team has a complete inventory of every SaaS subscription on the corporate card, every OAuth app with admin consent, every device on the corporate Wi-Fi. They have zero visibility into the seven processes bound to 127.0.0.1 on your senior engineer's laptop right now — a "deploy assistant" with a long-lived staging API token, a "ticket triager" subscribed to a customer-data Slack channel, a "release notes generator" with read access to the production analytics warehouse. None of it is on a vendor list. None of it shows up in the SSO logs. All of it is running on credentials the engineer already had, doing things nobody approved them to do.

This is shadow MCP, and it is the fastest-growing unmanaged authorization surface in the enterprise. The Model Context Protocol made it trivially cheap to wire any tool into any LLM, and engineers — being engineers — wired the obvious things first. Saviynt's CISO AI Risk Report puts the number at 75% of CISOs who have already discovered unsanctioned AI tools running in their production environments. The GitHub MCP server crossed two million weekly installs in early 2026. The Postgres MCP server, which gives an LLM a SQL prompt against any database the developer can reach, is north of 800,000 weekly installs. None of those numbers represent enterprise IT decisions.