The Third Copy: Vector Stores, Deletion Completeness, and the GDPR Gap RAG Teams Keep Missing
A user files a deletion request under GDPR Article 17. Your team kills the row in Postgres, purges the cached document in S3, and rotates the cached PDFs out of the CDN. Done. Privacy team signs off, security team signs off, the ticket closes. Six months later, an analytics engineer with read access to the vector index pulls a sample of float[1536] arrays for a clustering experiment, runs them through a publicly available inversion model, and reconstructs roughly nine in ten of the original 32-token chunks — including the documents you "deleted." Nobody planned this. Nobody is doing anything malicious. The pipeline just worked exactly as designed, against a threat model that never included the vector store as a copy of the data.
The mental error is the same in almost every RAG team I've seen: embeddings get treated as opaque numerical artifacts — derivatives, not data. Security reviews approve the launch because "embeddings aren't PII." Privacy reviews approve deletion handling because "the source text is gone." Both teams are wrong, and neither modeled the vector store as the third copy of the user's data — sitting next to the source database and the analytics warehouse, queryable by anyone with index read access, and outside the scope of every DLP scanner because nothing recognizes a 1536-dimensional float vector as sensitive.
This post is not another walkthrough of "your RAG needs ACLs." That post has been written. This is about the deletion side: what the inversion-attack literature actually proves, why GDPR/CCPA deletion handling falls short of "erasure" when embeddings persist, and the operational disciplines that turn the vector store from a forgotten back-door into a properly governed copy of user data.
What Inversion Attacks Actually Recover
The high-water-mark result is from the vec2text line of work: an iterative-refinement decoder applied to OpenAI-style sentence embeddings recovers around 92% of the original tokens, exactly, from 32-token inputs — and reaches BLEU scores in the high nineties. The attack does not need access to the original encoder. Follow-on work (transferable inversion, ALGEN-style few-shot alignment) has shown that an attacker can train a surrogate inverter against a different embedding model on as few as a thousand paired samples and then transfer it to your production embedding API.
The practical reading: if you embed a user's prompt, support-ticket message, doctor's note, or contract clause, and store the resulting vector in a place where someone can later read it, you are storing a lossy but recoverable copy of the source text. "Lossy" is not "encrypted" and it is not "anonymized." Membership-inference and attribute-inference attacks need even less than full reconstruction — they only need to confirm whether a specific record was in the index, or to recover the demographic class of the source. The privacy-preserving cover those attacks have to break is even thinner.
OWASP's LLM08:2025 entry formalized this category — vector and embedding weaknesses — alongside data poisoning and cross-context leaks. It's no longer a niche academic concern. It is a top-ten generative-AI risk class with documented attack code on GitHub.
Why GDPR Deletion Falls Through the Cracks
GDPR Article 17 ("right to erasure") and CCPA's deletion right operate on a definition of "personal data" that includes derivatives — anything from which an individual can be identified, directly or indirectly. The inversion literature is exactly the argument that embeddings of user content fall under that definition. But the operational pipelines almost never reflect this.
The standard deletion pipeline looks like:
- User submits the request.
- A worker deletes the user's rows from the primary database.
- A second worker invalidates caches and purges file storage.
- A third worker scrubs derived analytics and event logs against a retention window.
The vector store is rarely on that list. Even when it is, the integration is usually a delete_by_id(vector_id) call against a vector that was inserted at ingestion time — and the team relied on the embedding service to produce a stable ID mapping between source documents and vectors. If the user's content was chunked across ten vectors, two were re-embedded after a model upgrade with new IDs, and one was copied into a backup index for offline evaluation, the deletion call removes maybe seven of the ten.
The OWASP AI Security Verification Standard's Memory and Embeddings control (C08) calls this out specifically: vector stores must support tomb-stoning and hard-deletes such that revoked vectors cannot be recovered or re-indexed. AWS's Bedrock Knowledge Bases blog post on "right to be forgotten" explicitly describes the cascading deletion challenge — you have to identify every place a chunk landed (primary index, replicas, backups, evaluation snapshots, fine-tuning datasets if any) and remove it from all of them, and then you have to re-embed any documents that legitimately remained but were chunked together with the deleted content.
The brutal part: as of this writing, no major commercial vector database offers a provable deletion mechanism. You get a deletion API, you get audit logs, but you do not get a cryptographic guarantee that the vector is unrecoverable from disk pages, replication streams, or backup tapes. Compliance teams that have pushed for that guarantee from cloud DBs hit a wall: deletion is a best-effort operation against a substrate designed for fast nearest-neighbor search, not for forensic erasure.
The Third Copy
Here is the threat-model frame that helps. In any RAG pipeline that stores embeddings of user content, the user's data exists in (at least) three places:
- The source — Postgres row, S3 object, CRM record. Deletion is well-understood.
- The cache layer — CDN, Redis, browser. Deletion is somewhat understood, with TTL machinery.
- The vector index — Pinecone, Weaviate, Qdrant, pgvector, OpenSearch. Deletion is rarely well-understood.
Each copy has its own access-control model, its own backup story, its own retention policy, and its own breach-disclosure footprint. The third copy is the one most often inherited from a "this is just an index, not a database" mental model — and that's where the security review goes wrong.
The corrective is not technically complicated; it's organizational. Treat the vector index as a tier-1 production data store from day one:
- It is in scope for breach-disclosure laws if compromised.
- It is in scope for access audit (who has read; who has list).
- It is in scope for tenant-isolation requirements (logical partitioning, per-tenant encryption keys, no shared namespaces between paying customers).
- It is in scope for the right-to-erasure pipeline, with cascading deletes that touch primary, replica, backup, and any derived indices.
Most teams already have these disciplines for their relational databases. Lifting them onto the vector store is paperwork, mostly — the hard work was building the discipline the first time.
Operational Practices That Actually Reduce Blast Radius
The threat-model framing is necessary; it isn't sufficient. There are five operational practices worth adopting that do not show up on the typical RAG launch checklist.
Per-tenant encryption of vectors at rest. "Encryption at rest" using a single tenant key for the whole index is bookkeeping, not isolation. A backup snapshot of the index is a leak of every tenant's vectors. Per-tenant keys, with envelope encryption and key rotation policies tied to your existing KMS, give you the ability to revoke a tenant's key and render their vectors unreadable in backups — a much stronger guarantee than delete_by_id.
Application-layer encryption with distance-preserving schemes. This is the path the cryptographic community has been developing for vector data — approximate distance-preserving encryption that lets you do nearest-neighbor search over encrypted vectors without exposing the cleartext to the index. The trade-off is real (slower indexing, recall hit on the order of 1–3% in published benchmarks), but it transforms the inversion-attack threat: an attacker who steals an encrypted vector gets a vector that inverts to noise. For high-sensitivity domains (health, legal, finance), the recall trade is often the right call.
Query-vector logging policy. This one is overlooked. When a user sends a query, your retrieval layer embeds it and runs the search. If you log the query for debugging or analytics, you have just stored another embedding of user content — often in a system (your APM, your data warehouse) with weaker controls than the primary vector store. The fix: hash query vectors before logging, or store only metadata (timestamp, latency, top-k IDs) rather than the vector itself. Otherwise the analytics layer becomes the easiest exfiltration path.
Periodic re-embedding to invalidate old artifacts. Embedding models change. Every time you migrate to a new model, the old vectors become orphans — and if you keep them around in cold storage "in case we need to roll back," you've extended the retention horizon of every user's embedded content past whatever was disclosed in your privacy notice. A re-embedding rotation that purges the prior generation on a schedule (typically tied to the privacy notice's stated retention window) closes that gap.
Read-access audit on vector indexes the same way you audit production databases. This is the single highest-leverage discipline and the one most often missing. Most vector databases ship with a permissions model where every member of the data team has read access to every namespace because "we need it for evals." If your relational database with the same data would not allow that, the vector store should not either. The remediation is uncomfortable — engineers have to file access requests instead of querying directly — but it transforms the blast radius of an insider incident or a compromised dev account.
The Org Failure Mode
The pattern I have seen repeat: a RAG launch goes through the standard governance gauntlet. The security review concludes that embeddings are not PII because "they are not human-readable and not directly identifying." The privacy review concludes that the deletion pipeline is compliant because "the source text is removed." Each review is sound on its own terms; neither team is doing a poor job. What is missing is a step that asks: if the source text is recoverable from the embedding, does that change the conclusion of the other review?
The answer, given the inversion literature, is yes. But the organizational shape of most companies — security and privacy as separate functions, each rubber-stamping their slice — does not surface that question by default. Adding it is a one-line policy change: any ML feature that stores embeddings of user content requires a joint review where both teams sign off on the combined threat model, not the marginal pieces.
The mature version of this is to make derivative-PII scope explicit in the privacy notice: tell users that their data is stored as embeddings, that those embeddings have a deletion pipeline, and what the retention window is. That sentence is shorter than this paragraph and is the difference between "we comply" and "we comply legibly."
Closing
The new privacy boundary in AI systems is not "what did the model say." That question is well-understood, well-instrumented, and the subject of every prompt-injection and content-safety conversation in the industry. The boundary that the average team has not yet crossed is "what did the embedding remember." The inversion-attack literature crossed it three years ago. The regulators are crossing it now — most explicitly in EU enforcement guidance treating derivative representations as in-scope for Article 17. The vendors are not yet shipping the deletion guarantees that close the gap, which means the responsibility falls on you, the engineer, to model the vector store as a copy of the data and to extend every existing data-governance discipline to that copy.
The architectural realization is simple, even if the engineering work isn't: treat embeddings of user content as derivative PII, treat the vector index as a tier-1 production data store, and put the third copy into the same governance picture as the first two. Then go check who has read access to your production vector store. The number is almost always too high.
- https://arxiv.org/pdf/2310.06816
- https://arxiv.org/html/2406.10280v1
- https://arxiv.org/html/2504.00147v1
- https://aclanthology.org/2024.acl-long.422.pdf
- https://aclanthology.org/2025.acl-long.1185.pdf
- https://genai.owasp.org/llmrisk/llm082025-vector-and-embedding-weaknesses/
- https://ironcorelabs.com/blog/2024/text-embedding-privacy-risks/
- https://github.com/OWASP/AISVS/blob/main/1.0/en/0x10-C08-Memory-Embeddings-and-Vector-Database.md
- https://aws.amazon.com/blogs/machine-learning/implementing-knowledge-bases-for-amazon-bedrock-in-support-of-gdpr-right-to-be-forgotten-requests/
- https://milvus.io/ai-quick-reference/how-do-vector-dbs-comply-with-legal-data-privacy-regulations-eg-gdpr
