Skip to main content

60 posts tagged with "compliance"

View all tags

The Agent Memory Store That Survived Your Tenant Deletion Because Nobody Owned It

· 10 min read
Tian Pan
Software Engineer

A compliance program is a description of the systems your company had on the day the auditor signed off. The systems your company has today are a different set, and the gap is the surface area of every release that shipped a new persistent store between then and now. The deletion guarantee you sold your customers is a guarantee against the first set, and the regulator who eventually asks about it will be asking about the second.

The failure mode is not a bug in the deletion code. The deletion code is correct. The saga fans out across every storage system named in the data inventory, calls each one's erasure endpoint, collects a receipt per system, and reports success when every receipt comes back signed. The saga is doing exactly what it was built to do. The problem is that the saga is iterating over a list of storage systems that was true eighteen months ago, and the agent platform team shipped a long-term memory feature six months ago that nobody added to the list.

The Citation URL That Resolved But No Longer Said What the Model Quoted

· 10 min read
Tian Pan
Software Engineer

A RAG agent answers a customer's regulatory question with a tidy paragraph and a citation. The verification layer fetches the URL, sees a 200 OK, ticks the box, and ships. Six months later a compliance audit pulls the transcript, clicks the same link, and finds a page that now says the opposite of what the agent quoted. The URL is fine. The quote is fine in the transcript. The two no longer match. The customer's compliance officer asks whether the agent fabricated the quote, and the team cannot prove it didn't, because the only surviving evidence of what the URL used to say is the agent's own assertion of what it said.

This is not a hallucination in the usual sense. The model retrieved real content, faithfully extracted a real sentence, and emitted a real URL that still resolves. Every link-checker on earth would call this citation valid. The audit fails anyway, because the verification layer was measuring the wrong property. Reachability is not fidelity. A URL is a pointer to a mutable document under someone else's editorial control, and the moment the document changes, every transcript that quoted it becomes a hallucination report waiting to happen.

The Data Residency Contract Your Provider Honored at the API Boundary and Broke at the Cache

· 9 min read
Tian Pan
Software Engineer

Your residency audit traced every outbound request from the tenant's traffic, watched it terminate on a hostname in Frankfurt, and signed off. The audit was correct about everything it measured. It was also looking at the wrong layer. The request went to the EU. The bytes that satisfied the request — the cached prefix the provider hashed and pulled from the nearest available node — lived in us-east-1. Your regional endpoint promised you a destination. The cache promised nothing, because the cache was a different product, governed by a different SLA, designed for cost rather than for compliance.

The customer's auditor caught it. Not yours. A different vendor's incident report mentioned that prompt cache placement was decoupled from inference region, and the customer's GRC team asked the obvious follow-up question: where do our prefixes go? The contract amendment to close the gap took ninety days. The renewal got suspended. The team that wrote the integration had done nothing wrong by the documentation they were handed.

The Legal Disclaimer That Leaked From The Answer Into The Tool Call Arguments

· 9 min read
Tian Pan
Software Engineer

Your counsel approved a one-line system-prompt directive: append "This information is not legal advice and should not be relied upon as such" to every response touching a regulated domain. Three weeks later, a user files a bug because their calendar event's description field opens with that same line, followed by a contract summary the agent was supposed to put into a meeting invite. The agent did not malfunction. It did exactly what the system prompt told it to do, which turned out to be a behavior that ranges over every channel the model produces text into — including the JSON arguments of the next tool it called.

The instruction was a content-formatting rule and the model treated it as one. It did not distinguish "user-facing response" from "tool call argument" because nothing in the prompt told it those were different surfaces. The disclaimer ended up in the calendar, in the email draft, in the Slack message your agent posted on the user's behalf. Each of these was a separate downstream system whose author had no idea a compliance string was about to be injected into a structured field, and each had a different cleanup cost.

Retrieval Pipeline Residency: The Embedding That Crossed the Border Your LLM Call Didn't

· 9 min read
Tian Pan
Software Engineer

The team that ships "AI for EU customers" usually ships exactly one residency control: an inference endpoint pinned to an EU region. The procurement team gets a DPA, the architecture diagram gets a green checkmark next to "model hosted in Frankfurt," and the launch proceeds. What the diagram doesn't show is that the customer's verbatim query gets vectorized by a US-hosted embedding API on its way to the model, that the vector store the query is matched against has its operational plane in us-east-1, that the rerank model is a third-party SaaS deployed wherever the vendor chose, that the prompt cache is keyed regionally on hits and globally on misses, and that the trace store logging the retrieved chunks has a 30-day retention bucket that replicates cross-region for redundancy.

The inference layer respects residency. The retrieval pipeline doesn't even know it's a participant.

This is the gap where most "GDPR-compliant" RAG deployments fail an audit the team didn't realize was coming. The fix isn't another control on the model call — it's recognizing that data residency is a property of every component the customer's bytes touch, and that the team owning "the LLM" owns at most one of the six surfaces involved.

The Chain-of-Thought You Stripped to Save Tokens That Hid an Evidence Requirement

· 10 min read
Tian Pan
Software Engineer

A platform team shipped a prompt refactor that cut average response cost by thirty-two percent. The change was simple: strip the "explain your reasoning" preamble, ask the model to return only the JSON object, and drop the post-processing step that parsed the rationale out of the model's prose. The dashboard turned green. The unit economics page in the quarterly review went from yellow to gold. Nobody on the platform team thought to consult the risk team, because no part of the change touched the answer the customer received.

Two quarters later, a regulated customer's auditor requested the decision rationale for a denied-loan letter from a date six months prior. The team pulled the trace. The input was there. The output was there. The reasoning was gone — not because anyone deleted it, but because it had stopped being produced the day the refactor shipped. The customer's compliance program had been operating on the assumption that the rationale was somewhere in the trace store; the platform team had been operating on the assumption that the rationale was nobody's problem because the customer-facing answer was unchanged. Both assumptions were correct in isolation. Together they cost the customer a regulatory finding and the platform team a contract renewal.

The Dataset License That Retroactively Poisoned Your Fine-Tune

· 10 min read
Tian Pan
Software Engineer

The fine-tuned checkpoint that has been running in production for nine months is now sitting in a Slack thread between your CTO and outside counsel. A data source that you scraped under what looked like a permissive license has changed its terms, sent a notice, and named your model. Your engineers want to know whether the model can simply be "untrained" on the offending records. Counsel wants to know whether the weights file itself is now a regulated artifact. Nobody on the call has a good answer, because your training pipeline treated the license as an event — read once at ingestion time — instead of a state that the world can edit after you have already paid for the H100s.

This is the failure mode that very few fine-tuning playbooks bother to discuss. The license under which a dataset was distributed is not a static gate that you walk through at ingestion. It is an ongoing claim by a third party that you do not control, and the half-life of that claim is shrinking. Hugging Face's own legal repository quietly logs DMCA takedowns against named datasets every few weeks — AoPS pulling the MATH benchmark, PaperDemon pulling scraped artwork, Archive of Our Own removing a fanfiction dump within hours of notice. Each takedown is a downstream signal that some model somewhere was trained on data whose redistribution rights have since evaporated.

The Inference Region Your Data Residency Policy Forgot to Pin

· 9 min read
Tian Pan
Software Engineer

The compliance audit always starts with the same question and your team always answers it the same way. "Where is customer data processed?" In the EU region, the slide deck says, and the SDK config screenshot confirms it, and the DPA promises it. Then the auditor pulls a sample of last quarter's request logs, joins them to the provider's per-request region header, and the room gets quiet. Something like four percent of EU enterprise prompts were served by a US-region inference node during a forty-minute capacity event the team did not know happened. The cache that holds reusable prefixes was in the global pool. The trace store the support team queries is in us-east. The DPA was a slide deck. The contract was a routing hint.

This is the kind of incident that does not show up in a postmortem because no service degraded. The model returned an answer, the user got a response, the latency graph stayed flat. The thing that broke is a thing the dashboards were never wired to see: the geographic path of the request through the provider's infrastructure. Engineers who would never confuse a us-east-1 URL with "the request actually executed in us-east-1" routinely make that exact mistake at the LLM API layer, because the provider's region parameter looks like the AWS one, behaves like the AWS one in the happy path, and silently degrades to "best effort" the moment the preferred region runs out of GPU.

The Retention Policy That Erased Context Your Model Was Still Reading

· 12 min read
Tian Pan
Software Engineer

A nightly retention worker deletes any user message older than thirty days. A long-running enterprise support session, opened in early March, is still active in late May. On the request that comes in at turn 41, your prompt assembler reads from the same messages table the retention worker has been quietly pruning. Turns 1 through 28 are gone. The model receives a conversation that starts at turn 29 with no signal that earlier turns ever existed. The user asks "what was the SLA we agreed on earlier?" and the model confidently invents a number, because the actual answer was in turn 4 — which the retention worker erased the night before.

This is not a model failure. The model did exactly what it was supposed to: produce a plausible answer from the context it was handed. The failure happened upstream, in the gap between two teams that each thought they owned the messages table.

The Evidence Locker Your Agent Doesn't Keep

· 9 min read
Tian Pan
Software Engineer

Your trace logs every token. They log every tool call, every retry, every retrieval latency, every model id. They look exhaustive. Then a regulator, a customer, or your own incident channel asks the one question that should be easy: what did the model actually see at the moment it decided? And you discover that your trace recorded the questions but not the answers the model was looking at when it answered.

The retrieved chunks have rotated out of the vector store because the corpus was reindexed last Tuesday. The tool response was a streamed payload you stored only the final-state summary of, because storing the full stream tripled your bill. The system prompt was assembled at runtime from a feature flag that has since flipped twice, and your flag service does not retain historical values by timestamp. You have full observability over what happened — the call graph, the token counts, the latencies. You have nothing about what the model was answering against. That gap is the difference between a trace and a decision record, and most teams have not noticed they only built one of the two.

Your AI Disclosure Disappeared by Turn Three and Nobody Noticed Until the Regulator Did

· 11 min read
Tian Pan
Software Engineer

Your legal team spent four meetings negotiating the exact disclosure sentence. Engineering put it at the top of the system prompt. QA confirmed it appears in turn one of every session. Three months later a regulator forwards a transcript: turn fourteen of a complaint-handling conversation, an hour of substantive guidance about a refund dispute, and nowhere in those fourteen turns does the user see the words "I am an AI." The disclosure your single-turn compliance review approved is structurally incapable of surviving the conversations that need it.

This is disclosure decay, and it is the multi-turn agentic failure mode that the wave of 2025–2026 chatbot regulation was not designed to catch and your QA process is not configured to test for. The EU AI Act's Article 50 obligations become enforceable on August 2, 2026, with fines up to €35 million or 7% of global turnover. California's SB 243 took effect January 1, 2026, with a private right of action that lets consumers sue directly for at least $1,000 per violation. Washington requires recurring disclosures, with hourly cadences for minors. None of these regimes were written assuming the disclosure would silently drop out of a session after the third tool call — but that is what your runtime is doing right now, on every long-running conversation, in production.

Your Agent's Audit Log Records Everything Except the Reason

· 11 min read
Tian Pan
Software Engineer

Compliance forwards you a ticket. A customer was denied a refund by your support agent three weeks ago, they have escalated, and now someone needs to explain the decision. You feel calm about this, because you instrumented everything. Every prompt, every tool call, every retrieved chunk, every token count, every latency number — it is all in the trace, and you can pull it up in seconds.

You pull it up. You can see the agent received the refund request. You can see it called get_order_history, then check_return_window, then lookup_policy. You can see the exact policy text it retrieved. You can see the final message it sent: refund denied. The trace is complete. Every span is green. And you still cannot answer the question, because the trace shows you that the agent denied the refund and shows you everything it looked at, but it does not show you why those inputs added up to no. The reason lived in how the model weighed the context, and that weighing was never an artifact. It was never written down anywhere.

This is the gap between a trace and an explanation, and almost every team that says "we have full observability" has not noticed they only built the first half.