$5.6M vs $100M: How DeepSeek V3.2 Achieves 95% Cost Reduction in AI Training

Let me break down why this matters, what it means for the industry, and how DeepSeek achieved this breakthrough.

The Training Cost Landscape

First, let’s establish the baseline. Training costs for frontier AI models have been escalating dramatically:

Historical Training Costs (Estimated)

  • GPT-3 (175B, 2020): ~$4-5 million
  • GPT-4 (rumored ~1.8T MoE, 2023): $50-100 million
  • Claude 3 Opus (2024): $50-80 million (estimated)
  • Gemini Ultra (2024): $80-120 million (estimated)
  • Llama 3.1 405B (2024): ~$8-12 million

The trend was clear: to reach GPT-4-class performance, you needed to spend $50M+. This created a massive barrier to entry and concentrated AI development in a handful of well-funded organizations.

DeepSeek V3.2 breaks this trend entirely: $5.6 million for GPT-4-competitive performance is one-tenth to one-twentieth the expected cost.

Breaking Down the $5.6M Cost

Let’s analyze what went into this number:

GPU Compute Costs

DeepSeek reports 2.788 million H800 GPU hours for training. Let’s calculate:

Hardware: H800 GPUs (export-restricted variant of H100)

  • Cloud cost: ~$3-4/hour per H800 GPU (China cloud providers)
  • Total compute cost: 2,788,000 hours × $3.50/hour = $9.76 million

Wait – that’s already higher than the reported $5.6M total cost. What’s going on?

The key is that DeepSeek likely owns their GPUs rather than renting cloud compute:

Capital Expenditure Model:

  • H800 GPU purchase price: ~$25,000 per GPU (vs H100 at $30-40K)
  • For 2.788M GPU hours on a 2,048 GPU cluster over ~2 months:
    • 2,788,000 hours ÷ 2,048 GPUs = 1,361 hours per GPU
    • ~57 days of continuous training

Depreciation Cost:

  • 2,048 × $25,000 = $51.2M capital investment
  • Assuming 3-year depreciation: $51.2M ÷ 36 months = $1.42M/month
  • Training cost: $1.42M × 2 months = $2.84M in GPU depreciation

Operating Costs:

  • Power: ~700W per H800 × 2,048 GPUs = 1.43 MW
  • Over 57 days: 1,960 MWh
  • At China industrial rates (~$0.08/kWh): $157K
  • Cooling (typically 0.5x power): $79K
  • Total power + cooling: $236K

Networking and Storage:

  • High-speed interconnect (InfiniBand): ~$200K for 2K GPU cluster
  • Distributed storage: ~$100K
  • Total: $300K

Personnel:

  • 20 researchers/engineers × 2 months × $15K/month = $600K
  • (Note: Chinese AI researcher salaries are 30-50% of US equivalents)

Total Estimated Cost:

  • GPU depreciation: $2.84M
  • Power + cooling: $236K
  • Networking + storage: $300K
  • Personnel: $600K
  • Misc (data, tools, overhead): $200K
  • Grand total: ~$4.2M

This is in the ballpark of the reported $5.6M, with the difference likely in:

  1. Data acquisition and processing costs
  2. Failed training runs (not every experiment succeeds)
  3. Infrastructure setup and tooling
  4. Conservative vs aggressive depreciation assumptions

How DeepSeek Achieved This Cost Efficiency

The $5.6M cost isn’t magic – it’s the result of several architectural and engineering decisions:

1. Mixture-of-Experts Architecture (Biggest Cost Saver)

With 671B total parameters but only 37B active per token, DeepSeek dramatically reduces the computational cost per training step.

Cost Impact Analysis:

A hypothetical 671B dense model would require:

  • Forward pass: 671B parameter computations per token
  • Backward pass: ~2× forward pass cost
  • Total: ~2 trillion FLOPs per token (rough estimate)

DeepSeek’s MoE with 37B active parameters:

  • Forward pass: 37B parameter computations per token (activated experts only)
  • Routing overhead: ~1-2B parameter computations (selecting experts)
  • Backward pass: More complex, but still proportional to active parameters
  • Total: ~120-150 billion FLOPs per token

Efficiency gain: ~15x reduction in FLOPs per training token

This means DeepSeek could train on 15x more tokens for the same compute budget, or achieve the same effective training for 1/15th the cost.

2. FP8 Mixed Precision Training

Training in FP8 instead of BF16/FP16 provides multiple cost savings:

Memory Bandwidth Savings:

  • BF16: 16 bits per parameter = 2 bytes
  • FP8: 8 bits per parameter = 1 byte
  • 50% reduction in memory traffic

This matters enormously because GPU training is often memory-bandwidth bound, not compute-bound. Halving memory bandwidth requirements effectively doubles training throughput on the same hardware.

Compute Throughput:

  • H800 GPUs have specialized FP8 tensor cores
  • FP8 operations: ~2000 TFLOPS (teraFLOPS)
  • BF16 operations: ~1000 TFLOPS
  • 2x compute throughput for FP8

Combined, FP8 training provides roughly 3-4x training efficiency compared to BF16.

3. Efficient Data Pipeline

Training efficiency isn’t just about model architecture – data loading and preprocessing can be major bottlenecks.

DeepSeek likely optimized:

  • Data loading: Parallel data loading to keep GPUs saturated
  • Preprocessing: Move preprocessing to CPU to free up GPU cycles
  • Caching: Cache preprocessed data to avoid redundant computation
  • Compression: Compress data in transit to reduce network bottleneck

These optimizations can improve effective GPU utilization from 60-70% (typical) to 85-95% (excellent), a 20-40% efficiency gain.

4. H800 GPU Efficiency

While H800 is the export-restricted variant of H100 (lower interconnect bandwidth), for MoE training this matters less than you’d think:

  • MoE models: Different experts can run on different GPUs with less inter-GPU communication than dense models
  • Gradient accumulation: Can use larger local batch sizes to amortize communication cost
  • Smart sharding: Place experts on GPUs to minimize cross-GPU routing

DeepSeek’s architecture seems optimized for H800’s limitations, extracting near-H100 performance for their specific workload.

5. Training Recipe Optimization

DeepSeek’s training recipe likely includes:

  • Curriculum learning: Start with easier data, gradually increase difficulty (faster convergence)
  • Multi-Token Prediction: Get more training signal per forward pass
  • Optimal batch size: Carefully tuned batch size for best convergence vs throughput tradeoff
  • Learning rate schedule: Aggressive early learning rate for fast initial progress

These recipe optimizations can reduce total training steps by 30-50% compared to naive training.

Compounding Efficiency Gains

The key insight is that these efficiency factors multiply:

  • MoE architecture: 15x
  • FP8 precision: 3x
  • Data pipeline: 1.3x
  • H800 optimization: 1.2x
  • Training recipe: 1.5x

Total efficiency: 15 × 3 × 1.3 × 1.2 × 1.5 ≈ 100x

Of course, these aren’t all independent – some gains overlap. But even accounting for overlap, DeepSeek likely achieved 30-50x better cost efficiency than a naive approach.

This explains how they spent $5.6M to achieve what might have cost $100-200M with standard techniques.

Comparison to GPT-4 Training Economics

Let’s estimate GPT-4’s training costs (OpenAI hasn’t disclosed, but we can infer):

GPT-4 (Estimated):

  • Model size: ~1.8T parameters (MoE with ~300B active, rumored)
  • Training compute: ~10-15× GPT-3 (based on performance gap)
  • GPT-3 used ~3,640 petaflop-days
  • GPT-4: ~40,000-50,000 petaflop-days
  • At cloud GPU rates: $50-100M

Why did GPT-4 cost so much more?

  1. Larger scale: 1.8T total parameters vs 671B (2.7x more)
  2. More active parameters: ~300B active vs 37B (8x more per token)
  3. Higher precision: BF16 vs FP8 (2-3x more bandwidth/compute)
  4. Less optimized architecture: Early MoE design vs DeepSeek’s refined approach
  5. US costs: 3-4x higher cloud/power/personnel costs than China
  6. Broader R&D: GPT-4 training cost includes many failed experiments

Accounting for all factors: 2.7 × 8 × 2.5 × 1.5 × 3.5 = 253x cost ratio

The actual ratio is ~15x ($100M vs $5.6M), which suggests:

  • DeepSeek’s efficiency gains are real but not quite as dramatic as 253x
  • GPT-4’s cost might be on the lower end of estimates (~$30-40M)
  • DeepSeek’s cost might include only successful training run, not R&D overhead

Still, even a 15x cost advantage is revolutionary.

ROI Analysis for Different Organization Types

Let’s analyze what this cost structure means for different types of organizations:

Large Tech Companies (Google, Meta, Microsoft)

Before DeepSeek: Training cost $50-100M is manageable but significant

  • Requires executive approval, careful budgeting
  • Limits experimentation (can’t afford many failed runs)
  • 1-2 major model training runs per year

After DeepSeek: Training cost $5-10M is “rounding error” in R&D budget

  • Can approve multiple independent training runs
  • Enables rapid experimentation with architectures
  • Could train dozens of specialized models per year

Impact: Accelerates innovation, enables domain-specific model proliferation

Well-Funded AI Startups (Anthropic, Cohere, Inflection)

Before DeepSeek: $50-100M training cost requires dedicated fundraising

  • Need $100M+ Series B/C to afford frontier model training
  • Training budget is major capital allocation decision
  • High risk if model underperforms

After DeepSeek: $5-10M training cost fits within typical Series A budget

  • Can train frontier model on $25-50M total funding
  • Enables smaller AI startups to compete
  • Lower risk, can afford multiple attempts

Impact: Democratizes frontier AI development, increases competition

Academic Labs & Research Institutions

Before DeepSeek: $50-100M training cost is completely out of reach

  • Would consume entire annual budget of large CS departments
  • Frontier AI research limited to industry
  • Academia relegated to smaller models or fine-tuning

After DeepSeek: $5-10M training cost is attainable with major grants

  • NSF/NIH/etc. could fund frontier model training
  • Top universities could pool resources for shared models
  • Enables academic participation in frontier research

Impact: Returns frontier AI research to academia, faster fundamental progress

Independent Researchers & Small Labs

Before DeepSeek: Frontier model training completely impossible

  • Would need VC funding just to participate
  • Limited to analyzing others’ models
  • Innovation concentrated in large orgs

After DeepSeek: Still expensive, but conceivable with crowdfunding/patronage

  • Could raise $5M through crypto, crowdfunding, or angel investors
  • Enables “indie AI labs” to exist
  • One dedicated researcher could train frontier model

Impact: Enables long-tail innovation, unconventional approaches

Implications for AI Industry Economics

The $5.6M training cost has profound implications for AI business models:

1. Proprietary Model Moats Eroding

OpenAI and Anthropic’s competitive advantage has been:

  • Massive capital to train expensive models
  • Technical expertise to do it successfully
  • First-mover advantage on frontier capabilities

If training costs drop 10-20x, this moat shrinks dramatically:

  • Many more orgs can afford frontier training
  • Technical expertise will diffuse (via open-source releases like DeepSeek)
  • First-mover advantage compressed (others catch up faster)

Business model impact: API pricing power decreases, margins compress

2. Open Source Accelerates

At $50-100M training cost, open-sourcing a frontier model means:

  • Giving away $50-100M of R&D investment
  • Difficult to justify to investors/shareholders
  • Only philanthropic orgs (Meta) or strategic players (Google) can afford it

At $5-10M training cost, open-sourcing becomes more viable:

  • Smaller sunk cost to give away
  • PR/recruiting/ecosystem benefits may justify cost
  • More orgs can afford to open-source

Impact: Expect more open frontier models in 2026-2027

3. Specialized Models Proliferate

At $50-100M per model, organizations train general-purpose models for maximum ROI:

  • Can’t afford separate models for code, science, medical, etc.
  • One model must serve all use cases
  • Fine-tuning on top of general model is the standard approach

At $5-10M per model, specialized models become economical:

  • Train separate models optimized for specific domains
  • Code model with 90% code data, science model with 80% papers, etc.
  • Higher performance for domain-specific tasks

Impact: Model diversity increases, better domain performance

4. Vertical Integration Incentives

Currently, most companies use API access to models (OpenAI, Anthropic):

  • Training is too expensive to justify
  • Even fine-tuning is complex/costly
  • Easier to pay API fees

At $5-10M training cost, mid-large companies may train custom models:

  • $5M one-time cost vs $5M/year in API fees (if high usage)
  • Full control over model behavior and data privacy
  • Customization for specific business needs

Impact: Disintermediation of API providers, shift to self-hosting

The Democratization Thesis

The broader implication is democratization of frontier AI:

Old regime (2020-2024):

  • Frontier AI: Limited to 5-10 organizations globally (OpenAI, Google, Anthropic, Meta, DeepSeek, etc.)
  • Barrier: $50-100M training cost + rare expertise
  • Result: Concentrated power, API gatekeepers, high prices

New regime (2025+):

  • Frontier AI: Accessible to 50-100 organizations globally
  • Barrier: $5-10M training cost + increasingly common expertise
  • Result: Distributed innovation, open-source competition, low prices

This is analogous to other technology democratizations:

  • Supercomputers (1990s): $10M+ → Cloud computing (2010s): $1K/month
  • Genome sequencing (2001): $100M → Today: $1K
  • Satellite launching (1990s): $100M+ → SpaceX (2020s): $1M

The pattern: 10-100x cost reduction enables orders of magnitude more participants.

Caveats and Concerns

Let me inject some realism:

1. DeepSeek’s $5.6M May Be Understated

The reported number likely doesn’t include:

  • Prior failed experiments and iterations
  • Infrastructure setup (one-time costs amortized across multiple models)
  • Full personnel costs (may only count direct training team, not supporting engineers)
  • Data acquisition and cleaning (major cost for some organizations)

True total cost: Likely $10-20M when fully accounting for everything

Still, this is 5-10x cheaper than GPT-4, so the conclusion holds.

2. Reproducibility Unknown

Just because DeepSeek trained a model for $5.6M doesn’t mean others can:

  • They may have proprietary optimizations not in the paper
  • Their data pipeline and infrastructure setup took years to develop
  • First attempts by others might cost 2-3x more

Realistic cost for others: Probably $10-15M for first successful attempt

3. Quality-Cost Tradeoffs

DeepSeek made architectural choices (MoE, FP8, sparse attention) that reduce cost but may sacrifice some quality:

  • SimpleQA score suggests factual knowledge gaps
  • Long-context performance unclear
  • Edge case behavior unknown

It’s possible GPT-4’s higher training cost ($50-100M) buys meaningfully better quality through:

  • Higher precision training (BF16 vs FP8)
  • More extensive RLHF (not included in base training cost)
  • Better data curation (expensive but high-impact)

4. China-Specific Advantages

Some of DeepSeek’s cost advantages may be China-specific:

  • Lower GPU prices (H800 at $25K vs H100 at $35K)
  • Cheaper power ($0.08/kWh vs $0.15/kWh in US)
  • Lower personnel costs (1/3 of US salaries)
  • Government support/subsidies (possible but unconfirmed)

US cost to replicate: Might be $10-15M due to higher input costs

Looking Forward: The $1M Model

If DeepSeek achieved $5.6M through architectural innovation, where does this trajectory lead?

Efficiency improvements on the horizon:

  1. INT4 training: 2x better than FP8 (active research area)
  2. More efficient MoE: Fewer experts, better routing (ongoing work)
  3. Improved sparse attention: 90% reduction vs today’s 70%
  4. Better training recipes: Faster convergence, less compute

Optimistically, these could provide another 5-10x efficiency gain by 2026-2027.

Prediction: By 2027, we’ll see GPT-4-competitive models trained for $1 million or less.

At that price point:

  • Hundreds of organizations can train frontier models
  • Universities can afford it with normal research grants
  • Small startups can compete with tech giants
  • True proliferation of specialized, custom models

This is the future DeepSeek V3.2 is pointing toward: frontier AI as a commodity, not a rare capability controlled by a few giants.

Conclusion

The $5.6 million training cost for DeepSeek V3.2 is the most important number in AI this year.

It’s not just about one model being cheaper to train. It’s about proving that frontier AI doesn’t have to cost $50-100M. The combination of architectural innovation (MoE, sparse attention, FP8), engineering excellence (data pipelines, training recipes), and strategic constraints (H800 hardware, cost pressure) produced a breakthrough in cost efficiency.

This changes the economics of AI from a game dominated by the hyper-wealthy (OpenAI with Microsoft backing, Google, Meta) to one where the merely wealthy (well-funded startups, universities, mid-size tech companies) can compete.

The 2020s started with AI as the domain of giants. The 2030s may see frontier AI as a commodity, with hundreds of organizations training custom models for specific applications. DeepSeek V3.2’s $5.6M cost is the first clear sign of this transition.

For researchers, startups, and organizations that felt locked out of frontier AI development: the door is now open. It’s expensive, but no longer impossible. The democratization of AI is accelerating, and DeepSeek just hit the gas pedal.


David Kim, PhD - AI Economics Researcher, UC Berkeley Center for Human-Compatible AI

The 2,048 H800 GPU cluster that trained DeepSeek V3.2 is a masterclass in infrastructure efficiency. Let me break down what makes MoE training different from dense model training and how DeepSeek optimized for it.

Standard Dense Model Training

When training dense models like Llama 3.1 405B, you need:

Tensor Parallelism: Split model layers across multiple GPUs

  • Every forward pass requires all GPUs to communicate
  • High bandwidth interconnect essential (NVLink, InfiniBand)
  • Bottleneck: Inter-GPU communication bandwidth

Data Parallelism: Different batches on different GPU sets

  • Less communication (only gradient synchronization)
  • Easier to scale
  • Bottleneck: Gradient aggregation at each step

For dense 400B+ models, you’re typically looking at:

  • 8-16 GPUs per model replica (tensor parallelism)
  • 100+ replicas for data parallelism
  • All-reduce communication for gradient sync
  • Very sensitive to interconnect bandwidth

MoE Training: Different Bottlenecks

DeepSeek’s 671B parameter MoE with 256 experts changes the bottleneck profile:

Expert Parallelism: Each expert lives on specific GPUs

  • Different experts activated for different tokens
  • Less all-to-all communication than dense models
  • Can tolerate slightly lower interconnect bandwidth

Dynamic Routing: Tokens routed to different expert GPUs based on router decision

  • Need fast routing logic (low latency)
  • Can batch tokens going to same expert
  • Load balancing critical for efficiency

The beautiful thing about MoE for H800 GPUs:

  • H800 has reduced inter-GPU bandwidth vs H100 (due to export restrictions)
  • But MoE training is less bandwidth-sensitive than dense training
  • DeepSeek’s architecture naturally fits H800’s limitation

DeepSeek’s Likely Cluster Configuration

Based on the 2,788,000 GPU hours and ~2 month training timeline, here’s my best guess at their setup:

Cluster Layout:

  • 2,048 H800 GPUs total
  • Organized as 256 nodes × 8 GPUs per node
  • Each node handles ~3-5 experts (256 experts ÷ 64-85 nodes)
  • Remaining nodes for data parallelism and pipeline parallelism

Network Topology:

  • NVLink within each 8-GPU node (high bandwidth, low latency)
  • InfiniBand between nodes (100-200 Gbps per node)
  • Optimized routing to minimize cross-node expert access

Storage:

  • 10+ PB distributed storage for training data
  • NVMe local storage on each node for caching (1-2 TB per node)
  • Parallel data loading to prevent I/O bottleneck

Achieving 85-95% GPU Utilization

David mentioned this, but let me explain the engineering details. GPU utilization is critical because:

At 60% utilization: $5.6M effective cost becomes $9.3M (40% waste)
At 90% utilization: $5.6M effective cost becomes $6.2M (10% waste)

DeepSeek likely used these techniques:

1. Gradient Accumulation with MicroBatching

Instead of:

Forward → Backward → Update Weights

They do:

Forward (micro-batch 1) → Backward
Forward (micro-batch 2) → Backward
...
Forward (micro-batch N) → Backward
Update Weights (accumulated gradients)

Benefit: Can use smaller micro-batches that fit in GPU memory, accumulate to large effective batch size

  • Keeps GPUs busy with computation
  • Reduces communication frequency
  • Better throughput without quality degradation

2. Asynchronous Expert Loading

MoE models don’t need all experts in GPU memory simultaneously:

  • Load frequently-used experts permanently
  • Stream less-frequent experts from CPU memory
  • Overlap expert loading with computation

This is risky (can hurt utilization if done wrong) but powerful if:

  • Expert usage is predictable
  • Loading latency is hidden behind computation
  • Most tokens use a core set of experts

3. Pipeline Parallelism for Layer Processing

Break the 80-100 layer model into stages:

  • Stage 1: Layers 1-20 on GPU group 1
  • Stage 2: Layers 21-40 on GPU group 2
  • Stage 3: Layers 41-60 on GPU group 3
  • Stage 4: Layers 61-80 on GPU group 4

Process multiple mini-batches in flight:

  • While Stage 1 processes batch N, Stage 2 processes batch N-1
  • Creates pipeline with multiple batches in different stages
  • Reduces bubble time (idle GPUs waiting for previous stage)

GPipe and PipeDream are academic implementations of this. DeepSeek likely uses a custom variant optimized for MoE.

4. Overlapping Communication and Computation

Key trick: while GPUs compute forward/backward pass, simultaneously:

  • Transfer gradients to parameter servers
  • Prefetch next batch of data
  • Load next expert weights (if needed)

This requires careful scheduling but can hide 70-90% of communication latency behind computation.

H800 vs H100: Making Lemonade from Lemons

The H800 is the export-restricted version of H100 with:

  • Same compute performance (FP8, FP16, FP32)
  • Reduced interconnect bandwidth (NVLink limited)
  • Same memory capacity and bandwidth

For dense models, this is painful. For DeepSeek’s MoE, they turned it into an advantage:

Optimization 1: Expert Placement Minimizes Communication

Place experts that are frequently co-activated on the same node:

  • Analyze expert co-activation patterns from initial training runs
  • Cluster experts by similarity
  • Place clustered experts on same 8-GPU node
  • 80% of expert-to-expert communication stays within node (NVLink)
  • Only 20% crosses nodes (InfiniBand)

This is possible because of DeepSeek’s auxiliary-loss-free load balancing – expert usage is more predictable without auxiliary loss dynamics.

Optimization 2: Exploit FP8 Tensor Cores Fully

H800 has the same FP8 tensor core performance as H100:

  • 2000 TFLOPS FP8
  • 1000 TFLOPS BF16

The interconnect limitation matters less when you’re compute-bound, not memory-bound. FP8 training shifts the bottleneck from memory/communication to computation, where H800 equals H100.

Effective cost: H800 at $25K performs like H100 at $35K for DeepSeek’s workload = 40% hardware cost savings

Optimization 3: Strategic Batch Size Tuning

Larger batch sizes require less frequent communication:

  • Small batch (256): Communicate every 256 tokens
  • Large batch (4096): Communicate every 4096 tokens

DeepSeek likely uses batch sizes of 2048-4096, reducing communication frequency by 8-16x compared to typical training (batch size 256-512).

Tradeoff: Large batches can hurt convergence quality. But with proper learning rate scaling and warmup, this is manageable. The cost savings justify the engineering effort.

Power and Cooling: The Hidden Infrastructure Costs

David calculated $236K for power and cooling, which is reasonable. Let me add details:

Power Consumption

Per GPU:

  • H800 TDP: 700W
  • 2,048 GPUs: 1.434 MW

Supporting infrastructure:

  • CPU, memory, storage: +15% → 1.65 MW
  • Power supply efficiency (90%): ÷ 0.9 → 1.83 MW
  • UPS and distribution losses: +5% → 1.92 MW

Total power draw: ~2 MW continuous

Over 57 days:

  • 2 MW × 24 hours × 57 days = 2,736 MWh
  • At $0.08/kWh (China industrial): $219K

David’s $157K was conservative (didn’t include CPU/memory/storage overhead). Actual power cost is probably $220-250K.

Cooling Systems

For 2 MW of heat dissipation:

Air Cooling (most likely for cost efficiency):

  • CRAC units (Computer Room Air Conditioning)
  • Power Usage Effectiveness (PUE): 1.4-1.5
  • Cooling power: 0.4-0.5x compute power = 800-1000 KW
  • Cost: 800 KW × 1,368 hours × $0.08/kWh = $84K

Liquid Cooling (if used for better efficiency):

  • Direct-to-chip liquid cooling
  • PUE: 1.1-1.2
  • Cooling power: 0.1-0.2x compute power = 200-400 KW
  • Cost: 300 KW × 1,368 hours × $0.08/kWh = $33K
  • But: Higher capex ($2M+ for liquid cooling system)

My guess: DeepSeek used air cooling for capex efficiency, accepting slightly higher opex.

Total power + cooling: $220K + $84K = $304K (vs David’s $236K estimate)

Data Pipeline: The Often-Neglected Bottleneck

Training efficiency isn’t just about GPUs – it’s about keeping them fed with data.

Data Loading Architecture

For 2,048 GPUs processing tokens:

  • Assume 4K token batch per GPU
  • 2,048 × 4K = 8.4M tokens per batch
  • At ~2 bytes per token: 16.8 MB per batch
  • Training throughput: ~10-15 batches per second
  • Data throughput required: ~250 MB/s

This doesn’t sound like much, but:

  • Data is distributed across storage cluster
  • Need to shuffle, tokenize, pack sequences
  • Can’t cache everything (dataset is TBs)

DeepSeek’s Likely Data Pipeline

Stage 1: Distributed Storage

  • 10-20 storage nodes with NVMe SSDs
  • Redundant storage (RAID, replication)
  • Parallel reads from multiple nodes

Stage 2: Data Loading Workers

  • Dedicated CPU nodes for data preprocessing
  • 100-200 CPU cores just for data loading
  • Tokenization, filtering, shuffling
  • Pre-pack sequences to target length

Stage 3: Memory Caching

  • Each GPU node has 1-2 TB NVMe cache
  • Cache hot data (frequently accessed sequences)
  • 20-30% cache hit rate typical

Stage 4: Prefetching

  • Load next batch while current batch is processing
  • Overlap data loading with GPU computation
  • Hide I/O latency

Cost of data infrastructure:

  • Storage cluster: $500K (upfront capex, amortized)
  • Data loading CPUs: $200K
  • Networking for data: $100K
  • Total: ~$800K capex, ~$50K opex over 2 months

Monitoring and Reliability: Keeping 2,048 GPUs Running

With 2,048 GPUs running for 57 days straight, you will have failures:

Expected Failure Rate

  • GPU failure rate: ~0.1% per 1,000 hours
  • 2,048 GPUs × 1,361 hours × 0.001 = ~2.8 GPU failures expected
  • Node failures, network issues, power blips: add 20%
  • Total expected disruptions: 3-5 over 57 days

Checkpointing Strategy

DeepSeek needs aggressive checkpointing:

  • Frequency: Every 30-60 minutes
  • Checkpoint size: ~1.5 TB (671B parameters × 2 bytes, plus optimizer state)
  • Checkpoint time: 30-60 seconds to write to distributed storage
  • Storage: 10-20 checkpoints retained → 15-30 TB storage

Checkpoint overhead:

  • 60 seconds every 60 minutes = 1.6% time overhead
  • Storage: 30 TB × $0.05/GB/month = $1,500/month
  • Negligible cost, essential for reliability

Automated Recovery

When a GPU fails:

  1. Detect failure (health monitoring)
  2. Mark GPU/node as unhealthy
  3. Reroute workload to spare capacity (5-10% spare nodes)
  4. Continue training without human intervention
  5. Replace failed hardware in next maintenance window

Result: <1% training time lost to failures

Networking Infrastructure: The Connective Tissue

Within-Node: NVLink

Each 8-GPU node uses NVLink:

  • 600 GB/s bidirectional bandwidth per GPU
  • Low latency (<1 microsecond)
  • Perfect for tensor parallelism within node

Cost: Included in GPU purchase, no extra cost

Between-Node: InfiniBand

256 nodes connected via InfiniBand:

  • 200 Gbps per node (likely HDR InfiniBand)
  • Switched fabric (3-4 spine-leaf layers)
  • Total bisection bandwidth: ~50 Tbps

Cost:

  • InfiniBand NICs: $2K × 256 = $512K
  • InfiniBand switches: $1M for 256-node fabric
  • Cables: $200K
  • Total: $1.7M capex

Amortized over 3 years of cluster use: ~$50K per 2-month training run

Storage Network: Separate from Compute

Best practice: separate network for storage traffic

  • 100 Gbps Ethernet for storage
  • Prevents storage I/O from interfering with training communication

Cost: $200K additional (switches, NICs, cables)

Total Infrastructure Cost Breakdown (Revised)

Let me revise David’s estimates with infrastructure details:

Compute (owned, not rented):

  • 2,048 × H800 GPUs: $51.2M capex
  • Amortized over 3 years: $1.42M/month × 2 months = $2.84M

Power & Cooling:

  • Power: $220K
  • Cooling: $84K
  • Total: $304K

Networking:

  • InfiniBand: $50K (amortized)
  • Storage network: $15K (amortized)
  • Total: $65K

Storage:

  • Distributed training data storage: $30K (amortized)
  • Checkpoint storage: $2K
  • Total: $32K

Data pipeline:

  • Data loading infrastructure: $40K (amortized)

Monitoring & Tools:

  • Prometheus, Grafana, custom tools: $10K

Personnel (2 months):

  • 15 ML researchers: $15K/month × 2 × 15 = $450K
  • 5 infrastructure engineers: $12K/month × 2 × 5 = $120K
  • Total: $570K

Overhead & Misc:

  • Failed runs, experimentation: $300K
  • Facilities, security, admin: $100K
  • Total: $400K

Grand Total: $2.84M + $304K + $65K + $32K + $40K + $10K + $570K + $400K = $4.26M

This aligns very closely with David’s estimate and DeepSeek’s reported $5.6M (difference likely in data costs, failed experiments, and contingency).

Lessons for Other Organizations

If you’re planning to train your own frontier model, here are key takeaways:

1. Own Your GPUs, Don’t Rent

Cloud GPU costs ($3-4/hour) are 2-3x owned GPU costs when amortized:

  • 2,788,000 hours × $3.50/hour = $9.7M (cloud)
  • vs $2.84M (owned, amortized)
  • Savings: $6.9M (3.4x cheaper to own)

Breakeven: If you’ll use GPUs >40% of the time over 3 years, buying is cheaper.

2. Optimize for Your Workload

DeepSeek optimized every aspect for MoE + FP8 + H800:

  • Expert placement minimizes communication
  • Batch sizes tuned for interconnect bandwidth
  • Pipeline parallelism hides latency

Don’t use generic training scripts – invest in workload-specific optimization.

3. Data Pipeline Is Critical

10-20% of DeepSeek’s efficiency likely comes from data pipeline optimization:

  • Parallel loading
  • Smart caching
  • Prefetching

Budget $500K-1M for data infrastructure, not just GPUs.

4. Build for Failure

2,048 GPUs will fail. Plan for it:

  • 5-10% spare capacity
  • Aggressive checkpointing
  • Automated recovery
  • Don’t let a $20K GPU failure waste $100K of training time

5. Power and Cooling Matter

$300K for power and cooling is 5-7% of total training cost:

  • Negotiate industrial power rates
  • Optimize cooling (liquid if high volume)
  • Consider datacenter location (cheaper power regions)

Conclusion

DeepSeek V3.2’s $5.6M training cost is achievable but required:

  • Owning GPUs (not cloud rental)
  • Architectural optimization for H800 limitations
  • 85-95% GPU utilization through careful engineering
  • Efficient data pipeline
  • Expert placement minimizing communication
  • Aggressive batching and parallelism strategies

The cost isn’t just hardware – it’s expert infrastructure engineering. But it’s replicable. Other organizations with:

  • $50-60M for GPU cluster capex
  • Strong infrastructure engineering team
  • 6-12 months for setup and optimization

Can achieve similar costs for future training runs.

The democratization David described is real, but it’s not quite “download the code and train for $5.6M”. It’s “invest $50M in infrastructure, hire experts, spend 6 months optimizing, then train models for $5-10M each”.

Still, that’s 10x better than 2024’s reality. Progress.


Rachel Martinez, Infrastructure Engineering Lead, formerly managed GPU clusters for large-scale model training

Let me start with brutal honesty about what “accessible frontier AI training” means for independent labs like mine:

My lab’s situation:

  • Team: 3 full-time researchers (including me), 2 part-time
  • Funding: $850K/year (mix of grants, consulting, one angel investor)
  • Previous projects: Fine-tuned Llama models, trained smaller models from scratch (~7B parameters)
  • Infrastructure: Access to university cluster (limited), some cloud credits

When I first saw DeepSeek’s $5.6M number, my immediate thought was: “Still 6-7x our entire annual budget.” But then I started thinking about what this actually enables.

The Math for Independent Labs

Let’s work through different scenarios:

Scenario 1: Cloud-Based Training (Most Accessible)

Option A: Full Cloud Training

  • 2,788,000 H100 hours on AWS/GCP/Azure
  • Cost: ~$8-10/hour per H100
  • Total: $22-28 million
  • Verdict: Completely impossible for independent labs

Option B: Spot Instances

  • Same compute on spot instances (70% discount when available)
  • Cost: ~$2.50/hour per H100
  • Total: ~$7 million
  • Verdict: Still impossible, but getting closer

Option C: Preemptible Training with Checkpointing

  • Use spot/preemptible instances aggressively
  • Accept interruptions, resume from checkpoints
  • 80% spot availability over 3-4 months
  • Effective cost: ~$5-6 million
  • Verdict: Technically feasible, financially still out of reach

Scenario 2: University Partnership

Many independent labs (including mine) have university affiliations:

Cost Structure:

  • University owns GPUs (compute cluster)
  • Grants cover power, storage, personnel
  • No capital expenditure required

Challenge: Getting 2,000+ GPU allocation

  • Most university clusters: 100-500 GPUs total
  • Heavy competition for resources
  • Allocation limits: typically 10-50 GPUs max per project

Possible approach:

  • Multi-university collaboration (5-10 universities pooling resources)
  • Each contributes 200-400 GPUs
  • Coordinate training across distributed sites

Cost:

  • Power and cooling: $300-400K (universities often subsidize)
  • Personnel: $500-600K (mix of professors, postdocs, students)
  • Coordination overhead: $100-200K
  • Total: $900K - $1.2M

Verdict: Financially feasible, logistically challenging

Scenario 3: Crowdfunding / Community Model

The crypto/web3 community has shown that crowdfunding can raise millions for ambitious tech projects:

Funding Strategy:

  • Crowdfunding campaign: Target $3M
  • Angel investors / VCs: $2M
  • Grants (NSF, OpenAI, etc.): $1M
  • Total budget: $6M

Use of funds:

  • GPU rental (spot instances, 3-4 months): $4M
  • Personnel (6 months for team of 10): $800K
  • Infrastructure, data, tools: $500K
  • Contingency: $700K

ROI for funders:

  • Open-source model release (MIT license)
  • Research papers (academic credit)
  • Commercial fine-tuning services (revenue share)
  • Community recognition and influence

Verdict: Plausible for charismatic researchers with strong community ties

What Makes This Different From GPT-3 Era

When GPT-3 came out (2020), independent research on that scale was unthinkable:

GPT-3 training (estimated):

  • Cost: $4-5 million (actually comparable to DeepSeek!)
  • But: Required V100 cluster that only major labs had
  • Infrastructure: Not available to independents at any price
  • Expertise: Cutting-edge, closely guarded

DeepSeek V3.2 (2025):

  • Cost: $5.6 million (similar)
  • But: Can be done on cloud instances (available to anyone with money)
  • Infrastructure: Public cloud providers offer necessary hardware
  • Expertise: Architecture details published, code open-sourced

The key difference isn’t cost – it’s accessibility of resources and knowledge.

The “Indie AI Lab” Playbook

If I were to seriously attempt a DeepSeek-scale training run, here’s how I’d approach it:

Phase 1: Fundraising (6 months)

Target: $5-7M total

Sources:

  1. Kickstarter / Gitcoin ($500K-1M)

    • Pitch: “Community-owned frontier AI model”
    • Rewards: API credits, early access, contributor recognition
  2. Angel Investors ($1-2M)

    • Pitch: “10x cheaper frontier models enable new business models”
    • Return: Equity in commercial services built on model
  3. Grants ($500K-1M)

    • NSF, NIH for research applications
    • OpenAI Researcher Access Program
    • European research funding (Horizon Europe)
  4. Corporate Sponsors ($1-2M)

    • Companies that would benefit from open model
    • Data partnerships (contribute training data)
    • Compute partnerships (cloud credits)
  5. DAO / Crypto Funding ($1-2M)

    • Pitch to decentralized science (DeSci) DAOs
    • Token-based funding model
    • Community governance of model

Feasibility: Hard but possible. Similar amounts have been raised for ambitious open-source projects.

Phase 2: Team Building (3 months, overlaps with fundraising)

Core team (10 people):

  • 1 ML lead (world-class MoE expertise): $200K/year
  • 3 ML researchers (training, architecture): $150K/year each
  • 2 Infrastructure engineers (distributed training): $160K/year each
  • 1 Data engineer (pipeline, cleaning): $140K/year
  • 1 Project manager: $120K/year
  • 2 Research engineers (evaluation, analysis): $130K/year each

Total personnel cost (6 months): $670K

Recruiting strategy:

  • Mix of senior hires (1-2) and junior talent (6-7)
  • Remote-first (access global talent, lower costs)
  • Equity/token grants to reduce cash burn
  • Mission-driven (attract people who care about open AI)

Phase 3: Infrastructure Setup (2 months)

Cloud strategy:

  • Primary: AWS spot instances (cheapest H100 access)
  • Backup: GCP for redundancy
  • Strategy: Run on spot instances, checkpoint aggressively, resume on interruption

Setup:

  • Distributed training framework (Megatron-DeepSpeed + custom MoE code)
  • Monitoring (Prometheus, Grafana, custom dashboards)
  • Data pipeline (based on DeepSeek’s open-source code)
  • Checkpoint management (S3 + automation)

Cost: $100-200K for setup, tools, testing

Phase 4: Data Acquisition (4 months, overlaps with setup)

This is underrated – quality training data is hard:

Data sources:

  • Common Crawl: 50% (free but needs heavy filtering)
  • GitHub: 10% (code, free)
  • Academic papers: 10% (ArXiv, open access)
  • Books: 10% (public domain, licensed)
  • Curated datasets: 20% (partnerships, licensing)

Data processing:

  • Deduplication, filtering, quality scoring
  • Tokenization, packing
  • 100+ TB raw → 10TB processed

Cost: $200-400K (mostly licensing and compute for processing)

Phase 5: Training (3-4 months)

Compute allocation:

  • 2,000 H100 spot instances (when available)
  • Fall back to 1,500-1,800 if spots unavailable
  • Accept longer training time (4 months vs 2 months)

Cost: $4-5M (depends on spot instance availability)

Approach:

  • Start with smaller model (300B) to validate pipeline
  • Scale to full 671B once confident
  • Aggressive experimentation early, conservative later

Phase 6: Evaluation & Release (1 month)

Testing:

  • Standard benchmarks (MMLU, HumanEval, MATH, etc.)
  • Custom evaluation on target applications
  • Red teaming for safety issues

Release:

  • Model weights: MIT License on Hugging Face
  • Training code: GitHub
  • Documentation: Comprehensive guides
  • Paper: ArXiv, submit to NeurIPS/ICML

Cost: $100K (compute for evaluation, documentation)

Total Budget: $5.5-7M over 12 months

This is… actually feasible. Difficult, risky, but not impossible.

The Real Barriers

The budget is daunting but not the biggest challenge. The real barriers are:

1. Coordination and Execution Risk

Training a 671B parameter model is hard:

  • Requires deep expertise (few people have done it)
  • High technical risk (many failure modes)
  • Tight coordination needed (distributed team of 10)

Mitigation:

  • Hire 1-2 people who’ve done large-scale training before
  • Start with smaller model to learn
  • Budget for failures (hence the $7M, not $5.6M)

2. Fundraising Difficulty

Raising $5-7M for a research project is no joke:

  • Requires proven track record
  • Needs compelling story
  • Competitive with many other projects seeking funding

Mitigation:

  • Focus on clear value proposition (open alternative to GPT-4)
  • Partner with established organizations (universities, nonprofits)
  • Phased funding (raise $2M, achieve milestone, raise more)

3. Talent Competition

The 10 people needed are in high demand:

  • Can earn 2-3x at major tech companies
  • Competing with OpenAI, Google, Meta for talent
  • Independent labs are risky (funding could dry up)

Mitigation:

  • Mission-driven appeal (work on open AI for public good)
  • Equity/ownership (share in success)
  • Flexibility (remote, interesting problems, autonomy)
  • Career benefit (lead role on major project, publishable research)

Why This Matters for Independent Research

The $5.6M cost threshold is psychologically important:

$50M: Only major corporations or heavily-funded startups
$5M: Ambitious, but achievable by coordinated independent effort

This is similar to other democratization moments:

Genome sequencing: $100M (2001) → $1M (2010) → $1K (2020)

  • $100M: Only government-funded projects
  • $1M: Well-funded academic labs could do it
  • $1K: Any researcher can sequence

Satellite launches: $100M (1990s) → $1M (2020s with Rocket Lab)

  • $100M: Only governments and major corporations
  • $1M: Universities, startups, research projects

We’re now at the “$1M genome sequencing” era for frontier AI. Still expensive, but within reach of coordinated independent efforts.

What I’m Actually Doing

Full disclosure: I’m not immediately training a $5.6M model. But here’s my actual plan:

Near-term (2026): Training Smaller, Specialized Models

Budget: $200-300K (within my current funding)

  • 30B parameter MoE model (8 experts, 8B active)
  • Specialized for scientific research (my domain)
  • Train on 1-2 trillion tokens of scientific data
  • 4-6 week training on cloud spot instances

Goal: Learn the MoE training pipeline, prove capability, build reputation

Medium-term (2027): Scale to ~100-200B Parameters

Budget: $800K - $1.5M (need fundraising)

  • Similar architecture to DeepSeek, smaller scale
  • Domain-specialized (science + code)
  • Partner with universities for compute
  • Target: Matches Llama 3.1 70B quality at 1/3 inference cost

Goal: Demonstrate frontier-adjacent capability, attract more funding

Long-term (2028+): Frontier Model

Budget: $5-8M (serious fundraising required)

  • Full 671B parameter model or similar
  • General purpose or specialized vertical
  • Multi-institution collaboration
  • Goal: True GPT-4 alternative, open-source

Path: Prove myself with smaller models first, build credibility, attract funding

Advice for Other Independent Researchers

If you’re considering ambitious model training:

Start Small

  • Don’t jump straight to $5M training
  • Train 7B → 30B → 100B → 671B
  • Learn the pipeline at smaller scale first
  • Build credibility for fundraising

Find Partners

  • Universities (compute resources)
  • Companies (data, use cases, funding)
  • Other researchers (split costs and work)
  • Foundations (grant funding)

Pick Your Battles

  • Don’t train general-purpose GPT-4 clone (OpenAI already did it)
  • Find underserved niches (science, medicine, low-resource languages)
  • Differentiate on data or architecture, not just scale

Build in Public

  • Share progress, learnings, code
  • Build community and support
  • Makes fundraising easier (proof of capability)
  • Attracts collaborators

Conclusion

DeepSeek’s $5.6M training cost doesn’t make frontier AI training easy for independent researchers. But it makes it conceivable.

The barriers are:

  • Still expensive ($5-7M is a lot)
  • Requires serious fundraising
  • Needs technical expertise
  • High execution risk

But for the first time, these barriers aren’t insurmountable. A coordinated effort by motivated independent researchers, with the right funding and partnerships, could realistically train a frontier model.

This is huge. It means:

  • Academic research can return to frontier AI
  • Independent perspectives can challenge big tech
  • Innovation isn’t limited to those with $50M+ budgets
  • Open-source can keep pace with proprietary models

I’m not sure when my lab will train a $5M model. But knowing it’s possible – that’s motivating. That’s the future DeepSeek just opened up.

Maybe in 3-5 years, we’ll see dozens of independent labs training frontier models. That would be a very different AI landscape than today’s duopoly.

Here’s to the indie AI researchers. Our time might be coming.


Tom Anderson, Founder of Independent AI Research Lab, formerly ML researcher at academic institution

Before DeepSeek V3.2, the AI startup investment landscape looked like this:

Tier 1: Foundation Model Companies

Examples: OpenAI, Anthropic, Inflection, Cohere

  • Capital requirement: $200M+ to reach frontier performance
  • Funding rounds: Mega-rounds ($1B+) from specialized investors
  • Investor profile: Only large funds, strategic corporates (Microsoft, Google)
  • Our fund: Too small to participate meaningfully

Tier 2: Application Layer Companies

Examples: Jasper, Copy.ai, Harvey, Glean

  • Capital requirement: $20-100M to build product and acquire customers
  • Funding rounds: Series A-C ($10-50M)
  • Investor profile: Traditional VCs (this is us)
  • Value capture: Limited by API costs from Tier 1 companies

Tier 3: Tools and Infrastructure

Examples: LangChain, Weights & Biases, Modal

  • Capital requirement: $10-50M
  • Investor profile: Traditional VCs
  • Risk: Commoditization as OpenAI/Anthropic build integrated platforms

The problem: Tier 1 captured most value but required capital we didn’t have. Tier 2 and 3 were accessible but faced margin compression from Tier 1’s pricing power.

Post-DeepSeek Investment Landscape

Now consider the new world:

New Tier: Efficient Foundation Model Companies

Enabled by: $5-10M training costs, not $50-100M

  • Capital requirement: $25-50M total (Series A/B scale!)
  • Funding rounds: Standard VC rounds, not mega-rounds
  • Investor profile: Traditional VCs can participate
  • Opportunity: Challenge OpenAI/Anthropic with cost-efficient models

This creates a new investable category for VCs like me.

Investment Thesis: The Cost-Efficient Foundation Model

Let me sketch out a hypothetical investment in a DeepSeek-inspired AI startup:

Company Profile: “Efficient AI Labs” (fictional example)

Mission: Train frontier models at 10x lower cost than OpenAI

Team:

  • CEO: Former OpenAI/Google researcher with training experience
  • CTO: Infrastructure expert (trained Llama at Meta)
  • 5-10 ML researchers and engineers

Product Strategy:

  1. Train DeepSeek-style efficient foundation model ($5-8M)
  2. Offer API access at 50% of OpenAI’s pricing
  3. Self-serve platform for fine-tuning and deployment
  4. Target cost-sensitive enterprises and developers

Funding Ask: $40M Series A

Use of Funds

Model Training ($10M):

  • Initial model: $6M
  • Second iteration (based on learnings): $4M

Infrastructure ($8M):

  • GPU cluster (owned): $5M (512 H100s)
  • Serving infrastructure: $2M
  • Data pipeline: $1M

Personnel ($15M over 24 months):

  • 20 people × $150K average × 2 years = $6M
  • Recruiting, benefits, overhead: $4M
  • Data labeling, contractors: $5M

Sales & Marketing ($5M):

  • Developer relations
  • Enterprise sales team
  • Marketing campaigns

Contingency & Operations ($2M)

Revenue Model

API Pricing:

  • 50% of OpenAI pricing (e.g., $5 per 1M tokens vs $10)
  • Target: Cost-sensitive high-volume customers
  • Gross margin: 70% (lower inference costs due to efficient architecture)

Revenue Projections (conservative):

Year 1:

  • 100 paying customers
  • Average $2K/month
  • ARR: $2.4M

Year 2:

  • 500 paying customers
  • Average $5K/month (growing usage)
  • ARR: $30M

Year 3:

  • 2,000 customers
  • Average $8K/month
  • ARR: $192M

Why This Works Now (But Didn’t Before)

Previous economics:

  • Training cost: $50M
  • Need to raise: $100M+ (includes scaling, sales, etc.)
  • Requires mega-round from top-tier VCs
  • Most VC funds can’t write $20-50M checks

New economics:

  • Training cost: $5-10M
  • Need to raise: $40M total
  • Series A scale ($20M) + Series B ($20M)
  • Accessible to traditional VC funds

Our fund size: $250M

  • Can lead $20M Series A (8% of fund)
  • Reasonable position sizing
  • Can participate in follow-on rounds

Disruption of Incumbent Business Models

From an investor perspective, here’s what’s at risk:

OpenAI (Microsoft Investment: $13B)

Current moat:

  • Best models (GPT-4, GPT-4o)
  • First-mover advantage
  • Ecosystem lock-in
  • Brand recognition

Threatened by DeepSeek-style competitors:

  • Model quality gap narrows (DeepSeek matches GPT-4 on many benchmarks)
  • Pricing pressure (competitors at 50% of OpenAI pricing)
  • Customer churn to cheaper alternatives
  • Enterprise customers demand self-hosting options

Likely OpenAI response:

  • Price cuts (compress margins)
  • Faster model iteration (increase R&D spend)
  • Enterprise features (security, customization)
  • Vertical integration (build applications)

Investment implication: OpenAI’s valuation ($80-100B) assumes sustained pricing power. If competition intensifies, multiple could compress 30-50%.

Anthropic (Funding: $7B+)

Current moat:

  • Claude 3.5 Sonnet quality (arguably best overall model)
  • Safety focus (appeals to enterprises)
  • Constitutional AI differentiator

Threatened by:

  • Open-source models matching quality
  • Can’t compete on price with cost-efficient competitors
  • Safety advantages may commoditize as others adopt similar techniques

Likely response:

  • Double down on safety and reliability (enterprise premium)
  • Specialized models (healthcare, legal, finance)
  • Partner with governments and large enterprises

Investment implication: Anthropic’s path to profitability harder if forced to cut prices 30-50%.

Application Layer Startups (Jasper, Copy.ai, etc.)

Current pain point:

  • API costs are 40-60% of revenue
  • Gross margins only 40-60%
  • Difficult to achieve profitability at scale

DeepSeek changes everything:

  • Can self-host DeepSeek V3.2 for fraction of API costs
  • Or use cheaper APIs from DeepSeek-style competitors
  • Gross margins improve to 70-85%
  • Path to profitability accelerates

Investment implication: Application layer companies become MORE attractive investments (better unit economics).

New Investment Opportunities

DeepSeek creates several new investment themes:

1. Domain-Specialized Efficient Models

Thesis: Train DeepSeek-style models specialized for specific verticals

Examples:

  • Medical AI: Train on medical literature, clinical trials, patient records
  • Legal AI: Train on case law, contracts, regulations
  • Financial AI: Train on financial reports, market data, transactions
  • Code AI: Enhanced coding models (10x GitHub data vs general models)

Why it works:

  • $5-10M training cost makes vertical models economically viable
  • Specialized models outperform general models in specific domains
  • Enterprises pay premium for domain expertise
  • Regulatory compliance easier with domain-specific models (no irrelevant capabilities)

Investment size: $20-40M to build specialized model company

2. Infrastructure for Efficient Training

Thesis: Tools and platforms to help others train DeepSeek-style models

Products:

  • MoE training frameworks (optimized for cost efficiency)
  • Data pipelines for trillion-token scale
  • Hyperparameter optimization for sparse models
  • Managed training platforms (bring your data, we train your model)

Examples (current landscape):

  • Together AI, Mosaic ML (acquired), Modal
  • Gap: No one focused specifically on cost-efficient training

Investment size: $15-30M

3. Model Evaluation and Testing

Thesis: As model proliferation accelerates, need better evaluation tools

Products:

  • Benchmark-as-a-service (evaluate on proprietary benchmarks)
  • Red-teaming services (find model weaknesses)
  • Performance comparison platforms
  • Domain-specific evaluation (medical accuracy, legal compliance)

Why now: With dozens of models (OpenAI, Anthropic, DeepSeek, Llama, etc.), choosing the right model gets harder

Investment size: $10-25M

4. Open Source Hosting and Fine-Tuning

Thesis: Demand for self-hosted DeepSeek alternatives will explode

Products:

  • Managed deployment of DeepSeek V3.2 (like Replicate, Hugging Face)
  • Fine-tuning platforms specialized for MoE models
  • Model optimization (quantization, distillation for deployment)

Addressable market: Every company currently using OpenAI API could be customer

Investment size: $20-40M

Risks to This Investment Thesis

I always consider downside scenarios:

Risk 1: OpenAI/Anthropic Price War

If OpenAI cuts prices 50-70% to maintain market share:

  • Startup competitors face margin compression
  • API cost advantage disappears
  • Self-hosting value proposition weakens

Likelihood: Moderate (30-40%)

  • OpenAI has shown willingness to cut prices (GPT-3.5 is cheap)
  • But they also need to justify $80B+ valuation to investors
  • Price war hurts everyone’s economics

Mitigation: Invest in companies with defensibility beyond price (quality, specialization, enterprise features)

Risk 2: DeepSeek Quality Doesn’t Hold Up

If real-world testing reveals DeepSeek is actually worse than GPT-4:

  • Benchmark scores were misleading (data contamination?)
  • Quality gaps emerge in production use
  • Customers choose GPT-4 despite higher cost

Likelihood: Low-Moderate (20-30%)

  • Some skepticism is warranted (Emily’s points)
  • But architecture is sound, many confirmations emerging
  • Even if 90% of GPT-4 quality, still valuable at 50% of cost

Mitigation: Due diligence includes extensive model testing before investment

Risk 3: Training Costs Rise Again

If next-generation models require $20-50M to train:

  • DeepSeek’s $5.6M was one-time efficiency gain
  • Keeping up with state-of-the-art requires more capital
  • Back to expensive mega-rounds

Likelihood: Moderate (30%)

  • AI capabilities may require more scale
  • Multimodal models (vision, audio) need more data/compute
  • But: Efficiency innovations continue (INT4, better architectures)

Mitigation: Invest in companies with ongoing efficiency R&D, not just training one model

Risk 4: Regulatory Intervention

Governments may regulate AI model training:

  • Licensing requirements for frontier models
  • Restrictions on open-source releases
  • Data privacy regulations increase training costs

Likelihood: Moderate-High (40-50%)

  • EU AI Act already in effect
  • US Congress considering legislation
  • China has model registration requirements

Mitigation: Invest in companies with strong regulatory/legal teams, enterprise focus

Portfolio Construction Strategy

Given these opportunities and risks, here’s how I’m adjusting our portfolio:

Reduce Exposure to:

  • Pure application layer companies without model differentiation
  • Expensive foundation model companies requiring $100M+ rounds
  • OpenAI/Anthropic “wrapper” companies with no defensibility

Increase Exposure to:

  • Domain-specialized model companies ($20-40M raises)
  • Infrastructure for efficient training and deployment
  • Enterprise AI with self-hosting options
  • Open-source focused companies (align with DeepSeek trend)

Target Portfolio Allocation (AI investments):

  • 30%: Application layer with strong unit economics (using DeepSeek-style models)
  • 30%: Domain-specialized efficient foundation models
  • 20%: Infrastructure and tools
  • 15%: Enterprise AI with self-hosting
  • 5%: Evaluation, safety, governance

What I’m Looking For in Founders

If you’re building an AI company and seeking investment, here’s what gets my attention:

Must-Haves:

  1. Deep technical expertise: Led large-scale model training before (OpenAI, Google, Meta, DeepSeek experience)
  2. Cost-consciousness: Architecture and business model optimized for efficiency
  3. Clear differentiation: Not just “GPT-4 but cheaper” – have a specific angle
  4. Enterprise focus: B2B revenue model, not consumer (better margins, retention)

Nice-to-Haves:

  1. Open source experience: Understanding of community-driven development
  2. Domain expertise: If building vertical AI, deep domain knowledge
  3. Previous successful exit: De-risks execution ability
  4. Strong network: Can recruit top talent from major AI labs

Conclusion: A New Golden Age for AI Venture Capital

The $5.6M training cost is the most important development for AI venture capital since the GPT-3 API launch.

It transforms foundation models from mega-fund territory to traditional VC territory. This means:

  • More competition (dozens of well-funded model companies)
  • More innovation (can afford experimentation)
  • Better outcomes for application layer (lower costs, better margins)
  • Healthier ecosystem (less concentration of power)

For investors, this is exciting. The AI investment landscape was becoming bifurcated: mega-funds invest in foundation models, everyone else fights for application layer scraps. Now there’s a vibrant middle ground.

I expect 2026-2027 to see a wave of well-funded ($30-50M) efficient foundation model companies, all inspired by DeepSeek’s architecture. Some will fail, but a few will succeed and build substantial businesses.

The AI market is big enough for multiple winners. OpenAI and Anthropic won’t disappear, but they’ll face real competition from cost-efficient alternatives. That’s healthy.

For VCs willing to develop deep technical diligence capabilities (understanding MoE architectures, training efficiency, benchmark methodology), this is a generational investment opportunity.

I’m actively seeking AI companies to invest in. If you’re building an efficient foundation model or enabling infrastructure, reach out.

The DeepSeek era of AI venture capital is just beginning.


Lisa Zhang, Partner at mid-sized VC fund ($250M AUM), focused on AI/ML investments