Skip to main content

The Vector Dimension Tax: How Embedding Size Quietly Drains Your Budget

· 8 min read
Tian Pan
Software Engineer

Most teams building RAG systems spend zero time thinking about embedding dimensions. They grab text-embedding-3-large, leave the dimensions at the default 3072, and move on. At 10,000 documents that's fine. At 10 million, you've handed your cloud provider a 30/monthstoragebillthatshouldhavebeen30/month storage bill that should have been 3.75. At 100 million documents, you're staring at a terabyte of float32 values that mostly aren't earning their keep.

The relationship between embedding dimensions and actual retrieval quality is far weaker than the relationship between dimensions and operational cost. That gap — between the cost you're paying and the quality you're getting — is the vector dimension tax.

Why Dimensionality Costs More Than You Think

Each embedding dimension is a float32 value: 4 bytes. The math is boring but the totals aren't:

  • 1 million vectors at 768 dimensions = ~3 GB
  • 1 million vectors at 3072 dimensions = ~12 GB
  • 100 million vectors at 3072 dimensions = ~1.2 TB

That's just storage. The cost compounds through the entire retrieval stack. Vector similarity search scales with dimension — computing cosine similarity between 1536-dimensional vectors is roughly 4x slower than between 384-dimensional ones, all else equal. HNSW indexes (the standard for approximate nearest neighbor search) face worsening performance as dimensions climb, because the "curse of dimensionality" causes all vectors to become nearly equidistant from a query point, blunting the graph's directional guidance. Batch encoding throughput drops as the model produces larger output tensors per sample.

None of these effects are catastrophic in isolation. Together, they create a compounding cost structure most teams don't notice until they're scaling aggressively.

The Quality Side of the Tradeoff

Here's the uncomfortable truth about high-dimensional embeddings: for most production workloads, they don't move the needle on retrieval quality.

Look at what actually happens on the MTEB benchmark as you reduce dimensions on a Matryoshka-capable model like nomic-embed-text-v1.5:

  • 768 dimensions: 62.28 average score
  • 512 dimensions: 61.96 average score
  • 256 dimensions: 61.04 average score

A 66% reduction in dimension (768 → 256) costs less than 2 retrieval quality points. OpenAI published a similar finding for text-embedding-3-large: truncating from 3072 to 256 dimensions still outperforms the full 1536-dimensional text-embedding-ada-002 on MTEB. The information in the early dimensions is denser.

The domains where high dimensions genuinely help are narrow:

  • Complex technical or scientific documents with fine-grained distinctions
  • Multilingual content where cross-language alignment matters
  • Long-tail classification problems with many similar categories

If your product is a support chatbot, a document search tool, or a code assistant, you're almost certainly not in that set. You're paying for dimensions that are mostly noise relative to your query distribution.

Where the Default-Is-Fine Assumption Breaks Down

The "just use defaults" heuristic is fine at development scale. The problems emerge at three specific inflection points.

When your document count crosses a million. Storage costs become real. At 10M documents, choosing 3072 dimensions over 384 is an 8x increase in storage cost. Managed vector database pricing often runs 1.5–3x more expensive than self-hosted alternatives at this scale, which means dimension choice and hosting choice compound.

When your query latency budget gets tight. HNSW performs well up to about 768–1024 dimensions for typical sentence embedding models. Above 1024, the algorithm's greedy search becomes less effective as the geometry of high-dimensional spaces makes directional progress harder to achieve. You'll see this as degraded recall at your target ef setting, not as a clean latency regression.

When you need to switch models. Embedding models produce vectors in incompatible spaces. Moving from text-embedding-ada-002 to text-embedding-3-large means re-encoding your entire document corpus. At 100M documents with a 30ms encoding budget per batch, that's a multi-day job. Teams that haven't planned for this discover it when a model gets deprecated.

Matryoshka Embeddings: The Option You're Probably Not Using

The practical solution to the dimension tradeoff — for teams already using recent embedding models — is Matryoshka Representation Learning (MRL). The technique, published at NeurIPS 2022, trains a single embedding model to simultaneously optimize multiple loss functions at different dimension cutoffs: 3072 for a 3072-dimensional model, for example.

The result is that early dimensions capture the most semantically important information, and later dimensions add incremental detail. You can truncate the vector at inference time without re-encoding anything, and the truncated vector is a valid, high-quality embedding.

Several production-grade models now support this:

  • OpenAI text-embedding-3-small (1536 dims, reducible to 256–1536 via the dimensions API parameter)
  • OpenAI text-embedding-3-large (3072 dims, reducible to 256–3072)
  • Nomic nomic-embed-text-v1.5 (768 dims, reducible to 64–768)
  • Cohere embed-v4.0 (supports 256, 512, 1024, 1536)
  • Google Gemini Embedding 001 (3072 dims, reducible to 768+)

Most teams using these models haven't touched the dimensions parameter. The API supports it explicitly — for OpenAI, it's a single extra field in the embedding request. Moving from 3072 to 1024 dimensions for text-embedding-3-large cuts storage by 67% and typically costs less than 1–2% in retrieval quality on general-purpose benchmarks.

A Framework for Choosing Dimensions

Rather than optimizing by intuition, treat dimension choice as a decision with three inputs: document count, query latency budget, and retrieval quality target.

Under 1 million documents: Dimension choice doesn't meaningfully affect operational cost. Use whatever the model defaults to. Spend your time on chunking strategy and retrieval evaluation instead.

1–10 million documents: This is where the tradeoff starts mattering. Target 512–768 dimensions. If your model supports MRL, test 512 against 768 on a held-out sample of your actual queries. The quality difference is usually negligible; the storage difference is 33%.

10–100 million documents: Storage costs become a significant budget line. Target 256–512 dimensions with an MRL-capable model. OpenAI's benchmark showing 256-dimensional text-embedding-3-large outperforming full-dimensional text-embedding-ada-002 is directly relevant here — you don't need to sacrifice quality to reduce dimension if you're on a modern model.

Above 100 million documents: At this scale, self-hosted vector databases (Milvus, Qdrant, Weaviate) typically win on cost compared to managed services. Dimension reduction and quantization become necessary, not optional. Product Quantization (PQ) can compress vector storage by 64x through a separate mechanism from dimension reduction — the two approaches compound. An embedding at 512 dimensions with PQ applied can achieve the storage footprint of a 16-dimensional float32 vector while maintaining reasonable recall.

The Measurement Problem

The deeper issue isn't that teams pick wrong dimensions — it's that most teams don't have a retrieval quality measurement framework in place when they're making dimension decisions. They run a few ad-hoc queries, eyeball the results, and ship.

Without a labeled evaluation set (query + expected document pairs from your actual workload), you can't quantify the quality tradeoff at all. You're choosing dimensions on faith. The reasonable approach is:

  1. Build a small eval set (100–500 query/document pairs) from real production queries or domain-expert labeling.
  2. Benchmark your candidate embedding model at multiple dimension settings against that eval set.
  3. Pick the smallest dimension that clears your quality threshold.
  4. Revisit when your document distribution shifts significantly (major product expansion, new content types).

This takes a week to set up correctly. The cost savings at 10M+ documents will recover that time within the first month of production operation.

Dimension Choice Is an Engineering Decision, Not a Default

The pattern across teams that have built at serious scale — Shopify processing 216 million embeddings per day for product catalog search, database providers capping dimensions at 1998 to work within SQL storage constraints — is that dimension is treated as a first-class engineering parameter, not an afterthought.

The default 3072 dimensions on a large embedding model is the vendor's maximum, optimized for quality on their benchmark suite. It is not optimized for your storage budget, your query latency SLA, or your specific document distribution. For almost every production use case, a smaller dimension with a Matryoshka-capable model will hit the same quality bar at a fraction of the infrastructure cost.

The vector dimension tax only collects if you let it. The question is whether you measure before scaling or after.


The practical starting point: if you're using text-embedding-3-large, pass dimensions=1024 in your embedding API call. Run your retrieval eval before and after. The results will tell you whether you're paying 3x storage cost for a measurable quality benefit.

References:Let's stay in touch and Follow me for more thoughts and updates