Skip to main content

3 posts tagged with "data-residency"

View all tags

Retrieval Pipeline Residency: The Embedding That Crossed the Border Your LLM Call Didn't

· 9 min read
Tian Pan
Software Engineer

The team that ships "AI for EU customers" usually ships exactly one residency control: an inference endpoint pinned to an EU region. The procurement team gets a DPA, the architecture diagram gets a green checkmark next to "model hosted in Frankfurt," and the launch proceeds. What the diagram doesn't show is that the customer's verbatim query gets vectorized by a US-hosted embedding API on its way to the model, that the vector store the query is matched against has its operational plane in us-east-1, that the rerank model is a third-party SaaS deployed wherever the vendor chose, that the prompt cache is keyed regionally on hits and globally on misses, and that the trace store logging the retrieved chunks has a 30-day retention bucket that replicates cross-region for redundancy.

The inference layer respects residency. The retrieval pipeline doesn't even know it's a participant.

This is the gap where most "GDPR-compliant" RAG deployments fail an audit the team didn't realize was coming. The fix isn't another control on the model call — it's recognizing that data residency is a property of every component the customer's bytes touch, and that the team owning "the LLM" owns at most one of the six surfaces involved.

The Inference Region Your Data Residency Policy Forgot to Pin

· 9 min read
Tian Pan
Software Engineer

The compliance audit always starts with the same question and your team always answers it the same way. "Where is customer data processed?" In the EU region, the slide deck says, and the SDK config screenshot confirms it, and the DPA promises it. Then the auditor pulls a sample of last quarter's request logs, joins them to the provider's per-request region header, and the room gets quiet. Something like four percent of EU enterprise prompts were served by a US-region inference node during a forty-minute capacity event the team did not know happened. The cache that holds reusable prefixes was in the global pool. The trace store the support team queries is in us-east. The DPA was a slide deck. The contract was a routing hint.

This is the kind of incident that does not show up in a postmortem because no service degraded. The model returned an answer, the user got a response, the latency graph stayed flat. The thing that broke is a thing the dashboards were never wired to see: the geographic path of the request through the provider's infrastructure. Engineers who would never confuse a us-east-1 URL with "the request actually executed in us-east-1" routinely make that exact mistake at the LLM API layer, because the provider's region parameter looks like the AWS one, behaves like the AWS one in the happy path, and silently degrades to "best effort" the moment the preferred region runs out of GPU.

Multi-Region AI Deployment: Data Residency, Model Parity, and the Latency Tax Nobody Budgets

· 10 min read
Tian Pan
Software Engineer

When engineers budget for multi-region AI deployments, they typically account for two variables: infrastructure cost per region and replication overhead. What they consistently underestimate — sometimes catastrophically — are three costs that only appear once you're live: model parity gaps that make your EU cluster produce different outputs than your US cluster, KV cache isolation penalties that make every token in GDPR territory more expensive to generate, and silent compliance violations that trigger when your retry logic routes a French user's data through Virginia.

A German bank spent 14 months deploying a large open-source model on-premises to satisfy GDPR requirements. That's not unusual. What's unusual is that the engineers who proposed the architecture understood the compliance constraint upfront. Most don't until an incident report forces the conversation.