$5.6M vs $100M: How DeepSeek V3.2 Achieves 95% Cost Reduction in AI Training

david_economics · December 3, 2025, 11:44pm

Let me break down why this matters, what it means for the industry, and how DeepSeek achieved this breakthrough.

The Training Cost Landscape

First, let’s establish the baseline. Training costs for frontier AI models have been escalating dramatically:

Historical Training Costs (Estimated)

GPT-3 (175B, 2020): ~$4-5 million
GPT-4 (rumored ~1.8T MoE, 2023): $50-100 million
Claude 3 Opus (2024): $50-80 million (estimated)
Gemini Ultra (2024): $80-120 million (estimated)
Llama 3.1 405B (2024): ~$8-12 million

The trend was clear: to reach GPT-4-class performance, you needed to spend $50M+. This created a massive barrier to entry and concentrated AI development in a handful of well-funded organizations.

DeepSeek V3.2 breaks this trend entirely: $5.6 million for GPT-4-competitive performance is one-tenth to one-twentieth the expected cost.

Breaking Down the $5.6M Cost

Let’s analyze what went into this number:

GPU Compute Costs

DeepSeek reports 2.788 million H800 GPU hours for training. Let’s calculate:

Hardware: H800 GPUs (export-restricted variant of H100)

Cloud cost: ~$3-4/hour per H800 GPU (China cloud providers)
Total compute cost: 2,788,000 hours × $3.50/hour = $9.76 million

Wait – that’s already higher than the reported $5.6M total cost. What’s going on?

The key is that DeepSeek likely owns their GPUs rather than renting cloud compute:

Capital Expenditure Model:

H800 GPU purchase price: ~$25,000 per GPU (vs H100 at $30-40K)
For 2.788M GPU hours on a 2,048 GPU cluster over ~2 months:
- 2,788,000 hours ÷ 2,048 GPUs = 1,361 hours per GPU
- ~57 days of continuous training

Depreciation Cost:

2,048 × $25,000 = $51.2M capital investment
Assuming 3-year depreciation: $51.2M ÷ 36 months = $1.42M/month
Training cost: $1.42M × 2 months = $2.84M in GPU depreciation

Operating Costs:

Power: ~700W per H800 × 2,048 GPUs = 1.43 MW
Over 57 days: 1,960 MWh
At China industrial rates (~$0.08/kWh): $157K
Cooling (typically 0.5x power): $79K
Total power + cooling: $236K

Networking and Storage:

High-speed interconnect (InfiniBand): ~$200K for 2K GPU cluster
Distributed storage: ~$100K
Total: $300K

Personnel:

20 researchers/engineers × 2 months × $15K/month = $600K
(Note: Chinese AI researcher salaries are 30-50% of US equivalents)

Total Estimated Cost:

GPU depreciation: $2.84M
Power + cooling: $236K
Networking + storage: $300K
Personnel: $600K
Misc (data, tools, overhead): $200K
Grand total: ~$4.2M

This is in the ballpark of the reported $5.6M, with the difference likely in:

Data acquisition and processing costs
Failed training runs (not every experiment succeeds)
Infrastructure setup and tooling
Conservative vs aggressive depreciation assumptions

How DeepSeek Achieved This Cost Efficiency

The $5.6M cost isn’t magic – it’s the result of several architectural and engineering decisions:

1. Mixture-of-Experts Architecture (Biggest Cost Saver)

With 671B total parameters but only 37B active per token, DeepSeek dramatically reduces the computational cost per training step.

Cost Impact Analysis:

A hypothetical 671B dense model would require:

Forward pass: 671B parameter computations per token
Backward pass: ~2× forward pass cost
Total: ~2 trillion FLOPs per token (rough estimate)

DeepSeek’s MoE with 37B active parameters:

Forward pass: 37B parameter computations per token (activated experts only)
Routing overhead: ~1-2B parameter computations (selecting experts)
Backward pass: More complex, but still proportional to active parameters
Total: ~120-150 billion FLOPs per token

Efficiency gain: ~15x reduction in FLOPs per training token

This means DeepSeek could train on 15x more tokens for the same compute budget, or achieve the same effective training for 1/15th the cost.

2. FP8 Mixed Precision Training

Training in FP8 instead of BF16/FP16 provides multiple cost savings:

Memory Bandwidth Savings:

BF16: 16 bits per parameter = 2 bytes
FP8: 8 bits per parameter = 1 byte
50% reduction in memory traffic

This matters enormously because GPU training is often memory-bandwidth bound, not compute-bound. Halving memory bandwidth requirements effectively doubles training throughput on the same hardware.

Compute Throughput:

H800 GPUs have specialized FP8 tensor cores
FP8 operations: ~2000 TFLOPS (teraFLOPS)
BF16 operations: ~1000 TFLOPS
2x compute throughput for FP8

Combined, FP8 training provides roughly 3-4x training efficiency compared to BF16.

3. Efficient Data Pipeline

Training efficiency isn’t just about model architecture – data loading and preprocessing can be major bottlenecks.

DeepSeek likely optimized:

Data loading: Parallel data loading to keep GPUs saturated
Preprocessing: Move preprocessing to CPU to free up GPU cycles
Caching: Cache preprocessed data to avoid redundant computation
Compression: Compress data in transit to reduce network bottleneck

These optimizations can improve effective GPU utilization from 60-70% (typical) to 85-95% (excellent), a 20-40% efficiency gain.

4. H800 GPU Efficiency

While H800 is the export-restricted variant of H100 (lower interconnect bandwidth), for MoE training this matters less than you’d think:

MoE models: Different experts can run on different GPUs with less inter-GPU communication than dense models
Gradient accumulation: Can use larger local batch sizes to amortize communication cost
Smart sharding: Place experts on GPUs to minimize cross-GPU routing

DeepSeek’s architecture seems optimized for H800’s limitations, extracting near-H100 performance for their specific workload.

5. Training Recipe Optimization

DeepSeek’s training recipe likely includes:

Curriculum learning: Start with easier data, gradually increase difficulty (faster convergence)
Multi-Token Prediction: Get more training signal per forward pass
Optimal batch size: Carefully tuned batch size for best convergence vs throughput tradeoff
Learning rate schedule: Aggressive early learning rate for fast initial progress

These recipe optimizations can reduce total training steps by 30-50% compared to naive training.

Compounding Efficiency Gains

The key insight is that these efficiency factors multiply:

MoE architecture: 15x
FP8 precision: 3x
Data pipeline: 1.3x
H800 optimization: 1.2x
Training recipe: 1.5x

Total efficiency: 15 × 3 × 1.3 × 1.2 × 1.5 ≈ 100x

Of course, these aren’t all independent – some gains overlap. But even accounting for overlap, DeepSeek likely achieved 30-50x better cost efficiency than a naive approach.

This explains how they spent $5.6M to achieve what might have cost $100-200M with standard techniques.

Comparison to GPT-4 Training Economics

Let’s estimate GPT-4’s training costs (OpenAI hasn’t disclosed, but we can infer):

GPT-4 (Estimated):

Model size: ~1.8T parameters (MoE with ~300B active, rumored)
Training compute: ~10-15× GPT-3 (based on performance gap)
GPT-3 used ~3,640 petaflop-days
GPT-4: ~40,000-50,000 petaflop-days
At cloud GPU rates: $50-100M

Why did GPT-4 cost so much more?

Larger scale: 1.8T total parameters vs 671B (2.7x more)
More active parameters: ~300B active vs 37B (8x more per token)
Higher precision: BF16 vs FP8 (2-3x more bandwidth/compute)
Less optimized architecture: Early MoE design vs DeepSeek’s refined approach
US costs: 3-4x higher cloud/power/personnel costs than China
Broader R&D: GPT-4 training cost includes many failed experiments

Accounting for all factors: 2.7 × 8 × 2.5 × 1.5 × 3.5 = 253x cost ratio

The actual ratio is ~15x ($100M vs $5.6M), which suggests:

DeepSeek’s efficiency gains are real but not quite as dramatic as 253x
GPT-4’s cost might be on the lower end of estimates (~$30-40M)
DeepSeek’s cost might include only successful training run, not R&D overhead

Still, even a 15x cost advantage is revolutionary.

ROI Analysis for Different Organization Types

Let’s analyze what this cost structure means for different types of organizations:

Large Tech Companies (Google, Meta, Microsoft)

Before DeepSeek: Training cost $50-100M is manageable but significant

Requires executive approval, careful budgeting
Limits experimentation (can’t afford many failed runs)
1-2 major model training runs per year

After DeepSeek: Training cost $5-10M is “rounding error” in R&D budget

Can approve multiple independent training runs
Enables rapid experimentation with architectures
Could train dozens of specialized models per year

Impact: Accelerates innovation, enables domain-specific model proliferation

Well-Funded AI Startups (Anthropic, Cohere, Inflection)

Before DeepSeek: $50-100M training cost requires dedicated fundraising

Need $100M+ Series B/C to afford frontier model training
Training budget is major capital allocation decision
High risk if model underperforms

After DeepSeek: $5-10M training cost fits within typical Series A budget

Can train frontier model on $25-50M total funding
Enables smaller AI startups to compete
Lower risk, can afford multiple attempts

Impact: Democratizes frontier AI development, increases competition

Academic Labs & Research Institutions

Before DeepSeek: $50-100M training cost is completely out of reach

Would consume entire annual budget of large CS departments
Frontier AI research limited to industry
Academia relegated to smaller models or fine-tuning

After DeepSeek: $5-10M training cost is attainable with major grants

NSF/NIH/etc. could fund frontier model training
Top universities could pool resources for shared models
Enables academic participation in frontier research

Impact: Returns frontier AI research to academia, faster fundamental progress

Independent Researchers & Small Labs

Before DeepSeek: Frontier model training completely impossible

Would need VC funding just to participate
Limited to analyzing others’ models
Innovation concentrated in large orgs

After DeepSeek: Still expensive, but conceivable with crowdfunding/patronage

Could raise $5M through crypto, crowdfunding, or angel investors
Enables “indie AI labs” to exist
One dedicated researcher could train frontier model

Impact: Enables long-tail innovation, unconventional approaches

Implications for AI Industry Economics

The $5.6M training cost has profound implications for AI business models:

1. Proprietary Model Moats Eroding

OpenAI and Anthropic’s competitive advantage has been:

Massive capital to train expensive models
Technical expertise to do it successfully
First-mover advantage on frontier capabilities

If training costs drop 10-20x, this moat shrinks dramatically:

Many more orgs can afford frontier training
Technical expertise will diffuse (via open-source releases like DeepSeek)
First-mover advantage compressed (others catch up faster)

Business model impact: API pricing power decreases, margins compress

2. Open Source Accelerates

At $50-100M training cost, open-sourcing a frontier model means:

Giving away $50-100M of R&D investment
Difficult to justify to investors/shareholders
Only philanthropic orgs (Meta) or strategic players (Google) can afford it

At $5-10M training cost, open-sourcing becomes more viable:

Smaller sunk cost to give away
PR/recruiting/ecosystem benefits may justify cost
More orgs can afford to open-source

Impact: Expect more open frontier models in 2026-2027

3. Specialized Models Proliferate

At $50-100M per model, organizations train general-purpose models for maximum ROI:

Can’t afford separate models for code, science, medical, etc.
One model must serve all use cases
Fine-tuning on top of general model is the standard approach

At $5-10M per model, specialized models become economical:

Train separate models optimized for specific domains
Code model with 90% code data, science model with 80% papers, etc.
Higher performance for domain-specific tasks

Impact: Model diversity increases, better domain performance

4. Vertical Integration Incentives

Currently, most companies use API access to models (OpenAI, Anthropic):

Training is too expensive to justify
Even fine-tuning is complex/costly
Easier to pay API fees

At $5-10M training cost, mid-large companies may train custom models:

$5M one-time cost vs $5M/year in API fees (if high usage)
Full control over model behavior and data privacy
Customization for specific business needs

Impact: Disintermediation of API providers, shift to self-hosting

The Democratization Thesis

The broader implication is democratization of frontier AI:

Old regime (2020-2024):

Frontier AI: Limited to 5-10 organizations globally (OpenAI, Google, Anthropic, Meta, DeepSeek, etc.)
Barrier: $50-100M training cost + rare expertise
Result: Concentrated power, API gatekeepers, high prices

New regime (2025+):

Frontier AI: Accessible to 50-100 organizations globally
Barrier: $5-10M training cost + increasingly common expertise
Result: Distributed innovation, open-source competition, low prices

This is analogous to other technology democratizations:

Supercomputers (1990s): $10M+ → Cloud computing (2010s): $1K/month
Genome sequencing (2001): $100M → Today: $1K
Satellite launching (1990s): $100M+ → SpaceX (2020s): $1M

The pattern: 10-100x cost reduction enables orders of magnitude more participants.

Caveats and Concerns

Let me inject some realism:

1. DeepSeek’s $5.6M May Be Understated

The reported number likely doesn’t include:

Prior failed experiments and iterations
Infrastructure setup (one-time costs amortized across multiple models)
Full personnel costs (may only count direct training team, not supporting engineers)
Data acquisition and cleaning (major cost for some organizations)

True total cost: Likely $10-20M when fully accounting for everything

Still, this is 5-10x cheaper than GPT-4, so the conclusion holds.

2. Reproducibility Unknown

Just because DeepSeek trained a model for $5.6M doesn’t mean others can:

They may have proprietary optimizations not in the paper
Their data pipeline and infrastructure setup took years to develop
First attempts by others might cost 2-3x more

Realistic cost for others: Probably $10-15M for first successful attempt

3. Quality-Cost Tradeoffs

DeepSeek made architectural choices (MoE, FP8, sparse attention) that reduce cost but may sacrifice some quality:

SimpleQA score suggests factual knowledge gaps
Long-context performance unclear
Edge case behavior unknown

It’s possible GPT-4’s higher training cost ($50-100M) buys meaningfully better quality through:

Higher precision training (BF16 vs FP8)
More extensive RLHF (not included in base training cost)
Better data curation (expensive but high-impact)

4. China-Specific Advantages

Some of DeepSeek’s cost advantages may be China-specific:

Lower GPU prices (H800 at $25K vs H100 at $35K)
Cheaper power ($0.08/kWh vs $0.15/kWh in US)
Lower personnel costs (1/3 of US salaries)
Government support/subsidies (possible but unconfirmed)

US cost to replicate: Might be $10-15M due to higher input costs

Looking Forward: The $1M Model

If DeepSeek achieved $5.6M through architectural innovation, where does this trajectory lead?

Efficiency improvements on the horizon:

INT4 training: 2x better than FP8 (active research area)
More efficient MoE: Fewer experts, better routing (ongoing work)
Improved sparse attention: 90% reduction vs today’s 70%
Better training recipes: Faster convergence, less compute

Optimistically, these could provide another 5-10x efficiency gain by 2026-2027.

Prediction: By 2027, we’ll see GPT-4-competitive models trained for $1 million or less.

At that price point:

Hundreds of organizations can train frontier models
Universities can afford it with normal research grants
Small startups can compete with tech giants
True proliferation of specialized, custom models

This is the future DeepSeek V3.2 is pointing toward: frontier AI as a commodity, not a rare capability controlled by a few giants.

Conclusion

The $5.6 million training cost for DeepSeek V3.2 is the most important number in AI this year.

It’s not just about one model being cheaper to train. It’s about proving that frontier AI doesn’t have to cost $50-100M. The combination of architectural innovation (MoE, sparse attention, FP8), engineering excellence (data pipelines, training recipes), and strategic constraints (H800 hardware, cost pressure) produced a breakthrough in cost efficiency.

This changes the economics of AI from a game dominated by the hyper-wealthy (OpenAI with Microsoft backing, Google, Meta) to one where the merely wealthy (well-funded startups, universities, mid-size tech companies) can compete.

The 2020s started with AI as the domain of giants. The 2030s may see frontier AI as a commodity, with hundreds of organizations training custom models for specific applications. DeepSeek V3.2’s $5.6M cost is the first clear sign of this transition.

For researchers, startups, and organizations that felt locked out of frontier AI development: the door is now open. It’s expensive, but no longer impossible. The democratization of AI is accelerating, and DeepSeek just hit the gas pedal.

David Kim, PhD - AI Economics Researcher, UC Berkeley Center for Human-Compatible AI

rachel_engineer · December 3, 2025, 11:44pm

The 2,048 H800 GPU cluster that trained DeepSeek V3.2 is a masterclass in infrastructure efficiency. Let me break down what makes MoE training different from dense model training and how DeepSeek optimized for it.

Standard Dense Model Training

When training dense models like Llama 3.1 405B, you need:

Tensor Parallelism: Split model layers across multiple GPUs

Every forward pass requires all GPUs to communicate
High bandwidth interconnect essential (NVLink, InfiniBand)
Bottleneck: Inter-GPU communication bandwidth

Data Parallelism: Different batches on different GPU sets

Less communication (only gradient synchronization)
Easier to scale
Bottleneck: Gradient aggregation at each step

For dense 400B+ models, you’re typically looking at:

8-16 GPUs per model replica (tensor parallelism)
100+ replicas for data parallelism
All-reduce communication for gradient sync
Very sensitive to interconnect bandwidth

MoE Training: Different Bottlenecks

DeepSeek’s 671B parameter MoE with 256 experts changes the bottleneck profile:

Expert Parallelism: Each expert lives on specific GPUs

Different experts activated for different tokens
Less all-to-all communication than dense models
Can tolerate slightly lower interconnect bandwidth

Dynamic Routing: Tokens routed to different expert GPUs based on router decision

Need fast routing logic (low latency)
Can batch tokens going to same expert
Load balancing critical for efficiency

The beautiful thing about MoE for H800 GPUs:

H800 has reduced inter-GPU bandwidth vs H100 (due to export restrictions)
But MoE training is less bandwidth-sensitive than dense training
DeepSeek’s architecture naturally fits H800’s limitation

DeepSeek’s Likely Cluster Configuration

Based on the 2,788,000 GPU hours and ~2 month training timeline, here’s my best guess at their setup:

Cluster Layout:

2,048 H800 GPUs total
Organized as 256 nodes × 8 GPUs per node
Each node handles ~3-5 experts (256 experts ÷ 64-85 nodes)
Remaining nodes for data parallelism and pipeline parallelism

Network Topology:

NVLink within each 8-GPU node (high bandwidth, low latency)
InfiniBand between nodes (100-200 Gbps per node)
Optimized routing to minimize cross-node expert access

Storage:

10+ PB distributed storage for training data
NVMe local storage on each node for caching (1-2 TB per node)
Parallel data loading to prevent I/O bottleneck

Achieving 85-95% GPU Utilization

David mentioned this, but let me explain the engineering details. GPU utilization is critical because:

At 60% utilization: $5.6M effective cost becomes $9.3M (40% waste)
At 90% utilization: $5.6M effective cost becomes $6.2M (10% waste)

DeepSeek likely used these techniques:

1. Gradient Accumulation with MicroBatching

Instead of:

Forward → Backward → Update Weights

They do:

Forward (micro-batch 1) → Backward
Forward (micro-batch 2) → Backward
...
Forward (micro-batch N) → Backward
Update Weights (accumulated gradients)

Benefit: Can use smaller micro-batches that fit in GPU memory, accumulate to large effective batch size

Keeps GPUs busy with computation
Reduces communication frequency
Better throughput without quality degradation

2. Asynchronous Expert Loading

MoE models don’t need all experts in GPU memory simultaneously:

Load frequently-used experts permanently
Stream less-frequent experts from CPU memory
Overlap expert loading with computation

This is risky (can hurt utilization if done wrong) but powerful if:

Expert usage is predictable
Loading latency is hidden behind computation
Most tokens use a core set of experts

3. Pipeline Parallelism for Layer Processing

Break the 80-100 layer model into stages:

Stage 1: Layers 1-20 on GPU group 1
Stage 2: Layers 21-40 on GPU group 2
Stage 3: Layers 41-60 on GPU group 3
Stage 4: Layers 61-80 on GPU group 4

Process multiple mini-batches in flight:

While Stage 1 processes batch N, Stage 2 processes batch N-1
Creates pipeline with multiple batches in different stages
Reduces bubble time (idle GPUs waiting for previous stage)

GPipe and PipeDream are academic implementations of this. DeepSeek likely uses a custom variant optimized for MoE.

4. Overlapping Communication and Computation

Key trick: while GPUs compute forward/backward pass, simultaneously:

Transfer gradients to parameter servers
Prefetch next batch of data
Load next expert weights (if needed)

This requires careful scheduling but can hide 70-90% of communication latency behind computation.

H800 vs H100: Making Lemonade from Lemons

The H800 is the export-restricted version of H100 with:

Same compute performance (FP8, FP16, FP32)
Reduced interconnect bandwidth (NVLink limited)
Same memory capacity and bandwidth

For dense models, this is painful. For DeepSeek’s MoE, they turned it into an advantage:

Optimization 1: Expert Placement Minimizes Communication

Place experts that are frequently co-activated on the same node:

Analyze expert co-activation patterns from initial training runs
Cluster experts by similarity
Place clustered experts on same 8-GPU node
80% of expert-to-expert communication stays within node (NVLink)
Only 20% crosses nodes (InfiniBand)

This is possible because of DeepSeek’s auxiliary-loss-free load balancing – expert usage is more predictable without auxiliary loss dynamics.

Optimization 2: Exploit FP8 Tensor Cores Fully

H800 has the same FP8 tensor core performance as H100:

2000 TFLOPS FP8
1000 TFLOPS BF16

The interconnect limitation matters less when you’re compute-bound, not memory-bound. FP8 training shifts the bottleneck from memory/communication to computation, where H800 equals H100.

Effective cost: H800 at $25K performs like H100 at $35K for DeepSeek’s workload = 40% hardware cost savings

Optimization 3: Strategic Batch Size Tuning

Larger batch sizes require less frequent communication:

Small batch (256): Communicate every 256 tokens
Large batch (4096): Communicate every 4096 tokens

DeepSeek likely uses batch sizes of 2048-4096, reducing communication frequency by 8-16x compared to typical training (batch size 256-512).

Tradeoff: Large batches can hurt convergence quality. But with proper learning rate scaling and warmup, this is manageable. The cost savings justify the engineering effort.

Power and Cooling: The Hidden Infrastructure Costs

David calculated $236K for power and cooling, which is reasonable. Let me add details:

Power Consumption

Per GPU:

H800 TDP: 700W
2,048 GPUs: 1.434 MW

Supporting infrastructure:

CPU, memory, storage: +15% → 1.65 MW
Power supply efficiency (90%): ÷ 0.9 → 1.83 MW
UPS and distribution losses: +5% → 1.92 MW

Total power draw: ~2 MW continuous

Over 57 days:

2 MW × 24 hours × 57 days = 2,736 MWh
At $0.08/kWh (China industrial): $219K

David’s $157K was conservative (didn’t include CPU/memory/storage overhead). Actual power cost is probably $220-250K.

Cooling Systems

For 2 MW of heat dissipation:

Air Cooling (most likely for cost efficiency):

CRAC units (Computer Room Air Conditioning)
Power Usage Effectiveness (PUE): 1.4-1.5
Cooling power: 0.4-0.5x compute power = 800-1000 KW
Cost: 800 KW × 1,368 hours × $0.08/kWh = $84K

Liquid Cooling (if used for better efficiency):

Direct-to-chip liquid cooling
PUE: 1.1-1.2
Cooling power: 0.1-0.2x compute power = 200-400 KW
Cost: 300 KW × 1,368 hours × $0.08/kWh = $33K
But: Higher capex ($2M+ for liquid cooling system)

My guess: DeepSeek used air cooling for capex efficiency, accepting slightly higher opex.

Total power + cooling: $220K + $84K = $304K (vs David’s $236K estimate)

Data Pipeline: The Often-Neglected Bottleneck

Training efficiency isn’t just about GPUs – it’s about keeping them fed with data.

Data Loading Architecture

For 2,048 GPUs processing tokens:

Assume 4K token batch per GPU
2,048 × 4K = 8.4M tokens per batch
At ~2 bytes per token: 16.8 MB per batch
Training throughput: ~10-15 batches per second
Data throughput required: ~250 MB/s

This doesn’t sound like much, but:

Data is distributed across storage cluster
Need to shuffle, tokenize, pack sequences
Can’t cache everything (dataset is TBs)

DeepSeek’s Likely Data Pipeline

Stage 1: Distributed Storage

10-20 storage nodes with NVMe SSDs
Redundant storage (RAID, replication)
Parallel reads from multiple nodes

Stage 2: Data Loading Workers

Dedicated CPU nodes for data preprocessing
100-200 CPU cores just for data loading
Tokenization, filtering, shuffling
Pre-pack sequences to target length

Stage 3: Memory Caching

Each GPU node has 1-2 TB NVMe cache
Cache hot data (frequently accessed sequences)
20-30% cache hit rate typical

Stage 4: Prefetching

Load next batch while current batch is processing
Overlap data loading with GPU computation
Hide I/O latency

Cost of data infrastructure:

Storage cluster: $500K (upfront capex, amortized)
Data loading CPUs: $200K
Networking for data: $100K
Total: ~$800K capex, ~$50K opex over 2 months

Monitoring and Reliability: Keeping 2,048 GPUs Running

With 2,048 GPUs running for 57 days straight, you will have failures:

Expected Failure Rate

GPU failure rate: ~0.1% per 1,000 hours
2,048 GPUs × 1,361 hours × 0.001 = ~2.8 GPU failures expected
Node failures, network issues, power blips: add 20%
Total expected disruptions: 3-5 over 57 days

Checkpointing Strategy

DeepSeek needs aggressive checkpointing:

Frequency: Every 30-60 minutes
Checkpoint size: ~1.5 TB (671B parameters × 2 bytes, plus optimizer state)
Checkpoint time: 30-60 seconds to write to distributed storage
Storage: 10-20 checkpoints retained → 15-30 TB storage

Checkpoint overhead:

60 seconds every 60 minutes = 1.6% time overhead
Storage: 30 TB × $0.05/GB/month = $1,500/month
Negligible cost, essential for reliability

Automated Recovery

When a GPU fails:

Detect failure (health monitoring)
Mark GPU/node as unhealthy
Reroute workload to spare capacity (5-10% spare nodes)
Continue training without human intervention
Replace failed hardware in next maintenance window

Result: <1% training time lost to failures

Networking Infrastructure: The Connective Tissue

Within-Node: NVLink

Each 8-GPU node uses NVLink:

600 GB/s bidirectional bandwidth per GPU
Low latency (<1 microsecond)
Perfect for tensor parallelism within node

Cost: Included in GPU purchase, no extra cost

Between-Node: InfiniBand

256 nodes connected via InfiniBand:

200 Gbps per node (likely HDR InfiniBand)
Switched fabric (3-4 spine-leaf layers)
Total bisection bandwidth: ~50 Tbps

Cost:

InfiniBand NICs: $2K × 256 = $512K
InfiniBand switches: $1M for 256-node fabric
Cables: $200K
Total: $1.7M capex

Amortized over 3 years of cluster use: ~$50K per 2-month training run

Storage Network: Separate from Compute

Best practice: separate network for storage traffic

100 Gbps Ethernet for storage
Prevents storage I/O from interfering with training communication

Cost: $200K additional (switches, NICs, cables)

Total Infrastructure Cost Breakdown (Revised)

Let me revise David’s estimates with infrastructure details:

Compute (owned, not rented):

2,048 × H800 GPUs: $51.2M capex
Amortized over 3 years: $1.42M/month × 2 months = $2.84M

Power & Cooling:

Power: $220K
Cooling: $84K
Total: $304K

Networking:

InfiniBand: $50K (amortized)
Storage network: $15K (amortized)
Total: $65K

Storage:

Distributed training data storage: $30K (amortized)
Checkpoint storage: $2K
Total: $32K

Data pipeline:

Data loading infrastructure: $40K (amortized)

Monitoring & Tools:

Prometheus, Grafana, custom tools: $10K

Personnel (2 months):

15 ML researchers: $15K/month × 2 × 15 = $450K
5 infrastructure engineers: $12K/month × 2 × 5 = $120K
Total: $570K

Overhead & Misc:

Failed runs, experimentation: $300K
Facilities, security, admin: $100K
Total: $400K

Grand Total: $2.84M + $304K + $65K + $32K + $40K + $10K + $570K + $400K = $4.26M

This aligns very closely with David’s estimate and DeepSeek’s reported $5.6M (difference likely in data costs, failed experiments, and contingency).

Lessons for Other Organizations

If you’re planning to train your own frontier model, here are key takeaways:

1. Own Your GPUs, Don’t Rent

Cloud GPU costs ($3-4/hour) are 2-3x owned GPU costs when amortized:

2,788,000 hours × $3.50/hour = $9.7M (cloud)
vs $2.84M (owned, amortized)
Savings: $6.9M (3.4x cheaper to own)

Breakeven: If you’ll use GPUs >40% of the time over 3 years, buying is cheaper.

2. Optimize for Your Workload

DeepSeek optimized every aspect for MoE + FP8 + H800:

Expert placement minimizes communication
Batch sizes tuned for interconnect bandwidth
Pipeline parallelism hides latency

Don’t use generic training scripts – invest in workload-specific optimization.

3. Data Pipeline Is Critical

10-20% of DeepSeek’s efficiency likely comes from data pipeline optimization:

Parallel loading
Smart caching
Prefetching

Budget $500K-1M for data infrastructure, not just GPUs.

4. Build for Failure

2,048 GPUs will fail. Plan for it:

5-10% spare capacity
Aggressive checkpointing
Automated recovery
Don’t let a $20K GPU failure waste $100K of training time

5. Power and Cooling Matter

$300K for power and cooling is 5-7% of total training cost:

Negotiate industrial power rates
Optimize cooling (liquid if high volume)
Consider datacenter location (cheaper power regions)

Conclusion

DeepSeek V3.2’s $5.6M training cost is achievable but required:

Owning GPUs (not cloud rental)
Architectural optimization for H800 limitations
85-95% GPU utilization through careful engineering
Efficient data pipeline
Expert placement minimizing communication
Aggressive batching and parallelism strategies

The cost isn’t just hardware – it’s expert infrastructure engineering. But it’s replicable. Other organizations with:

$50-60M for GPU cluster capex
Strong infrastructure engineering team
6-12 months for setup and optimization

Can achieve similar costs for future training runs.

The democratization David described is real, but it’s not quite “download the code and train for $5.6M”. It’s “invest $50M in infrastructure, hire experts, spend 6 months optimizing, then train models for $5-10M each”.

Still, that’s 10x better than 2024’s reality. Progress.

Rachel Martinez, Infrastructure Engineering Lead, formerly managed GPU clusters for large-scale model training

tom_independent · December 3, 2025, 11:44pm

Let me start with brutal honesty about what “accessible frontier AI training” means for independent labs like mine:

My lab’s situation:

Team: 3 full-time researchers (including me), 2 part-time
Funding: $850K/year (mix of grants, consulting, one angel investor)
Previous projects: Fine-tuned Llama models, trained smaller models from scratch (~7B parameters)
Infrastructure: Access to university cluster (limited), some cloud credits

When I first saw DeepSeek’s $5.6M number, my immediate thought was: “Still 6-7x our entire annual budget.” But then I started thinking about what this actually enables.

The Math for Independent Labs

Let’s work through different scenarios:

Scenario 1: Cloud-Based Training (Most Accessible)

Option A: Full Cloud Training

2,788,000 H100 hours on AWS/GCP/Azure
Cost: ~$8-10/hour per H100
Total: $22-28 million
Verdict: Completely impossible for independent labs

Option B: Spot Instances

Same compute on spot instances (70% discount when available)
Cost: ~$2.50/hour per H100
Total: ~$7 million
Verdict: Still impossible, but getting closer

Option C: Preemptible Training with Checkpointing

Use spot/preemptible instances aggressively
Accept interruptions, resume from checkpoints
80% spot availability over 3-4 months
Effective cost: ~$5-6 million
Verdict: Technically feasible, financially still out of reach

Scenario 2: University Partnership

Many independent labs (including mine) have university affiliations:

Cost Structure:

University owns GPUs (compute cluster)
Grants cover power, storage, personnel
No capital expenditure required

Challenge: Getting 2,000+ GPU allocation

Most university clusters: 100-500 GPUs total
Heavy competition for resources
Allocation limits: typically 10-50 GPUs max per project

Possible approach:

Multi-university collaboration (5-10 universities pooling resources)
Each contributes 200-400 GPUs
Coordinate training across distributed sites

Cost:

Power and cooling: $300-400K (universities often subsidize)
Personnel: $500-600K (mix of professors, postdocs, students)
Coordination overhead: $100-200K
Total: $900K - $1.2M

Verdict: Financially feasible, logistically challenging

Scenario 3: Crowdfunding / Community Model

The crypto/web3 community has shown that crowdfunding can raise millions for ambitious tech projects:

Funding Strategy:

Crowdfunding campaign: Target $3M
Angel investors / VCs: $2M
Grants (NSF, OpenAI, etc.): $1M
Total budget: $6M

Use of funds:

GPU rental (spot instances, 3-4 months): $4M
Personnel (6 months for team of 10): $800K
Infrastructure, data, tools: $500K
Contingency: $700K

ROI for funders:

Open-source model release (MIT license)
Research papers (academic credit)
Commercial fine-tuning services (revenue share)
Community recognition and influence

Verdict: Plausible for charismatic researchers with strong community ties

What Makes This Different From GPT-3 Era

When GPT-3 came out (2020), independent research on that scale was unthinkable:

GPT-3 training (estimated):

Cost: $4-5 million (actually comparable to DeepSeek!)
But: Required V100 cluster that only major labs had
Infrastructure: Not available to independents at any price
Expertise: Cutting-edge, closely guarded

DeepSeek V3.2 (2025):

Cost: $5.6 million (similar)
But: Can be done on cloud instances (available to anyone with money)
Infrastructure: Public cloud providers offer necessary hardware
Expertise: Architecture details published, code open-sourced

The key difference isn’t cost – it’s accessibility of resources and knowledge.

The “Indie AI Lab” Playbook

If I were to seriously attempt a DeepSeek-scale training run, here’s how I’d approach it:

Phase 1: Fundraising (6 months)

Target: $5-7M total

Sources:

Kickstarter / Gitcoin ($500K-1M)
- Pitch: “Community-owned frontier AI model”
- Rewards: API credits, early access, contributor recognition
Angel Investors ($1-2M)
- Pitch: “10x cheaper frontier models enable new business models”
- Return: Equity in commercial services built on model
Grants ($500K-1M)
- NSF, NIH for research applications
- OpenAI Researcher Access Program
- European research funding (Horizon Europe)
Corporate Sponsors ($1-2M)
- Companies that would benefit from open model
- Data partnerships (contribute training data)
- Compute partnerships (cloud credits)
DAO / Crypto Funding ($1-2M)
- Pitch to decentralized science (DeSci) DAOs
- Token-based funding model
- Community governance of model

Feasibility: Hard but possible. Similar amounts have been raised for ambitious open-source projects.

Phase 2: Team Building (3 months, overlaps with fundraising)

Core team (10 people):

1 ML lead (world-class MoE expertise): $200K/year
3 ML researchers (training, architecture): $150K/year each
2 Infrastructure engineers (distributed training): $160K/year each
1 Data engineer (pipeline, cleaning): $140K/year
1 Project manager: $120K/year
2 Research engineers (evaluation, analysis): $130K/year each

Total personnel cost (6 months): $670K

Recruiting strategy:

Mix of senior hires (1-2) and junior talent (6-7)
Remote-first (access global talent, lower costs)
Equity/token grants to reduce cash burn
Mission-driven (attract people who care about open AI)

Phase 3: Infrastructure Setup (2 months)

Cloud strategy:

Primary: AWS spot instances (cheapest H100 access)
Backup: GCP for redundancy
Strategy: Run on spot instances, checkpoint aggressively, resume on interruption

Setup:

Distributed training framework (Megatron-DeepSpeed + custom MoE code)
Monitoring (Prometheus, Grafana, custom dashboards)
Data pipeline (based on DeepSeek’s open-source code)
Checkpoint management (S3 + automation)

Cost: $100-200K for setup, tools, testing

Phase 4: Data Acquisition (4 months, overlaps with setup)

This is underrated – quality training data is hard:

Data sources:

Common Crawl: 50% (free but needs heavy filtering)
GitHub: 10% (code, free)
Academic papers: 10% (ArXiv, open access)
Books: 10% (public domain, licensed)
Curated datasets: 20% (partnerships, licensing)

Data processing:

Deduplication, filtering, quality scoring
Tokenization, packing
100+ TB raw → 10TB processed

Cost: $200-400K (mostly licensing and compute for processing)

Phase 5: Training (3-4 months)

Compute allocation:

2,000 H100 spot instances (when available)
Fall back to 1,500-1,800 if spots unavailable
Accept longer training time (4 months vs 2 months)

Cost: $4-5M (depends on spot instance availability)

Approach:

Start with smaller model (300B) to validate pipeline
Scale to full 671B once confident
Aggressive experimentation early, conservative later

Phase 6: Evaluation & Release (1 month)

Testing:

Standard benchmarks (MMLU, HumanEval, MATH, etc.)
Custom evaluation on target applications
Red teaming for safety issues

Release:

Model weights: MIT License on Hugging Face
Training code: GitHub
Documentation: Comprehensive guides
Paper: ArXiv, submit to NeurIPS/ICML

Cost: $100K (compute for evaluation, documentation)

Total Budget: $5.5-7M over 12 months

This is… actually feasible. Difficult, risky, but not impossible.

The Real Barriers

The budget is daunting but not the biggest challenge. The real barriers are:

1. Coordination and Execution Risk

Training a 671B parameter model is hard:

Requires deep expertise (few people have done it)
High technical risk (many failure modes)
Tight coordination needed (distributed team of 10)

Mitigation:

Hire 1-2 people who’ve done large-scale training before
Start with smaller model to learn
Budget for failures (hence the $7M, not $5.6M)

2. Fundraising Difficulty

Raising $5-7M for a research project is no joke:

Requires proven track record
Needs compelling story
Competitive with many other projects seeking funding

Mitigation:

Focus on clear value proposition (open alternative to GPT-4)
Partner with established organizations (universities, nonprofits)
Phased funding (raise $2M, achieve milestone, raise more)

3. Talent Competition

The 10 people needed are in high demand:

Can earn 2-3x at major tech companies
Competing with OpenAI, Google, Meta for talent
Independent labs are risky (funding could dry up)

Mitigation:

Mission-driven appeal (work on open AI for public good)
Equity/ownership (share in success)
Flexibility (remote, interesting problems, autonomy)
Career benefit (lead role on major project, publishable research)

Why This Matters for Independent Research

The $5.6M cost threshold is psychologically important:

$50M: Only major corporations or heavily-funded startups
$5M: Ambitious, but achievable by coordinated independent effort

This is similar to other democratization moments:

Genome sequencing: $100M (2001) → $1M (2010) → $1K (2020)

$100M: Only government-funded projects
$1M: Well-funded academic labs could do it
$1K: Any researcher can sequence

Satellite launches: $100M (1990s) → $1M (2020s with Rocket Lab)

$100M: Only governments and major corporations
$1M: Universities, startups, research projects

We’re now at the “$1M genome sequencing” era for frontier AI. Still expensive, but within reach of coordinated independent efforts.

What I’m Actually Doing

Full disclosure: I’m not immediately training a $5.6M model. But here’s my actual plan:

Near-term (2026): Training Smaller, Specialized Models

Budget: $200-300K (within my current funding)

30B parameter MoE model (8 experts, 8B active)
Specialized for scientific research (my domain)
Train on 1-2 trillion tokens of scientific data
4-6 week training on cloud spot instances

Goal: Learn the MoE training pipeline, prove capability, build reputation

Medium-term (2027): Scale to ~100-200B Parameters

Budget: $800K - $1.5M (need fundraising)

Similar architecture to DeepSeek, smaller scale
Domain-specialized (science + code)
Partner with universities for compute
Target: Matches Llama 3.1 70B quality at 1/3 inference cost

Goal: Demonstrate frontier-adjacent capability, attract more funding

Long-term (2028+): Frontier Model

Budget: $5-8M (serious fundraising required)

Full 671B parameter model or similar
General purpose or specialized vertical
Multi-institution collaboration
Goal: True GPT-4 alternative, open-source

Path: Prove myself with smaller models first, build credibility, attract funding

Advice for Other Independent Researchers

If you’re considering ambitious model training:

Start Small

Don’t jump straight to $5M training
Train 7B → 30B → 100B → 671B
Learn the pipeline at smaller scale first
Build credibility for fundraising

Find Partners

Universities (compute resources)
Companies (data, use cases, funding)
Other researchers (split costs and work)
Foundations (grant funding)

Pick Your Battles

Don’t train general-purpose GPT-4 clone (OpenAI already did it)
Find underserved niches (science, medicine, low-resource languages)
Differentiate on data or architecture, not just scale

Build in Public

Share progress, learnings, code
Build community and support
Makes fundraising easier (proof of capability)
Attracts collaborators

Conclusion

DeepSeek’s $5.6M training cost doesn’t make frontier AI training easy for independent researchers. But it makes it conceivable.

The barriers are:

Still expensive ($5-7M is a lot)
Requires serious fundraising
Needs technical expertise
High execution risk

But for the first time, these barriers aren’t insurmountable. A coordinated effort by motivated independent researchers, with the right funding and partnerships, could realistically train a frontier model.

This is huge. It means:

Academic research can return to frontier AI
Independent perspectives can challenge big tech
Innovation isn’t limited to those with $50M+ budgets
Open-source can keep pace with proprietary models

I’m not sure when my lab will train a $5M model. But knowing it’s possible – that’s motivating. That’s the future DeepSeek just opened up.

Maybe in 3-5 years, we’ll see dozens of independent labs training frontier models. That would be a very different AI landscape than today’s duopoly.

Here’s to the indie AI researchers. Our time might be coming.

Tom Anderson, Founder of Independent AI Research Lab, formerly ML researcher at academic institution

lisa_investor · December 3, 2025, 11:44pm

Before DeepSeek V3.2, the AI startup investment landscape looked like this:

Tier 1: Foundation Model Companies

Examples: OpenAI, Anthropic, Inflection, Cohere

Capital requirement: $200M+ to reach frontier performance
Funding rounds: Mega-rounds ($1B+) from specialized investors
Investor profile: Only large funds, strategic corporates (Microsoft, Google)
Our fund: Too small to participate meaningfully

Tier 2: Application Layer Companies

Examples: Jasper, Copy.ai, Harvey, Glean

Capital requirement: $20-100M to build product and acquire customers
Funding rounds: Series A-C ($10-50M)
Investor profile: Traditional VCs (this is us)
Value capture: Limited by API costs from Tier 1 companies

Tier 3: Tools and Infrastructure

Examples: LangChain, Weights & Biases, Modal

Capital requirement: $10-50M
Investor profile: Traditional VCs
Risk: Commoditization as OpenAI/Anthropic build integrated platforms

The problem: Tier 1 captured most value but required capital we didn’t have. Tier 2 and 3 were accessible but faced margin compression from Tier 1’s pricing power.

Post-DeepSeek Investment Landscape

Now consider the new world:

New Tier: Efficient Foundation Model Companies

Enabled by: $5-10M training costs, not $50-100M

Capital requirement: $25-50M total (Series A/B scale!)
Funding rounds: Standard VC rounds, not mega-rounds
Investor profile: Traditional VCs can participate
Opportunity: Challenge OpenAI/Anthropic with cost-efficient models

This creates a new investable category for VCs like me.

Investment Thesis: The Cost-Efficient Foundation Model

Let me sketch out a hypothetical investment in a DeepSeek-inspired AI startup:

Company Profile: “Efficient AI Labs” (fictional example)

Mission: Train frontier models at 10x lower cost than OpenAI

Team:

CEO: Former OpenAI/Google researcher with training experience
CTO: Infrastructure expert (trained Llama at Meta)
5-10 ML researchers and engineers

Product Strategy:

Train DeepSeek-style efficient foundation model ($5-8M)
Offer API access at 50% of OpenAI’s pricing
Self-serve platform for fine-tuning and deployment
Target cost-sensitive enterprises and developers

Funding Ask: $40M Series A

Use of Funds

Model Training ($10M):

Initial model: $6M
Second iteration (based on learnings): $4M

Infrastructure ($8M):

GPU cluster (owned): $5M (512 H100s)
Serving infrastructure: $2M
Data pipeline: $1M

Personnel ($15M over 24 months):

20 people × $150K average × 2 years = $6M
Recruiting, benefits, overhead: $4M
Data labeling, contractors: $5M

Sales & Marketing ($5M):

Developer relations
Enterprise sales team
Marketing campaigns

Contingency & Operations ($2M)

Revenue Model

API Pricing:

50% of OpenAI pricing (e.g., $5 per 1M tokens vs $10)
Target: Cost-sensitive high-volume customers
Gross margin: 70% (lower inference costs due to efficient architecture)

Revenue Projections (conservative):

Year 1:

100 paying customers
Average $2K/month
ARR: $2.4M

Year 2:

500 paying customers
Average $5K/month (growing usage)
ARR: $30M

Year 3:

2,000 customers
Average $8K/month
ARR: $192M

Why This Works Now (But Didn’t Before)

Previous economics:

Training cost: $50M
Need to raise: $100M+ (includes scaling, sales, etc.)
Requires mega-round from top-tier VCs
Most VC funds can’t write $20-50M checks

New economics:

Training cost: $5-10M
Need to raise: $40M total
Series A scale ($20M) + Series B ($20M)
Accessible to traditional VC funds

Our fund size: $250M

Can lead $20M Series A (8% of fund)
Reasonable position sizing
Can participate in follow-on rounds

Disruption of Incumbent Business Models

From an investor perspective, here’s what’s at risk:

OpenAI (Microsoft Investment: $13B)

Current moat:

Best models (GPT-4, GPT-4o)
First-mover advantage
Ecosystem lock-in
Brand recognition

Threatened by DeepSeek-style competitors:

Model quality gap narrows (DeepSeek matches GPT-4 on many benchmarks)
Pricing pressure (competitors at 50% of OpenAI pricing)
Customer churn to cheaper alternatives
Enterprise customers demand self-hosting options

Likely OpenAI response:

Price cuts (compress margins)
Faster model iteration (increase R&D spend)
Enterprise features (security, customization)
Vertical integration (build applications)

Investment implication: OpenAI’s valuation ($80-100B) assumes sustained pricing power. If competition intensifies, multiple could compress 30-50%.

Anthropic (Funding: $7B+)

Current moat:

Claude 3.5 Sonnet quality (arguably best overall model)
Safety focus (appeals to enterprises)
Constitutional AI differentiator

Threatened by:

Open-source models matching quality
Can’t compete on price with cost-efficient competitors
Safety advantages may commoditize as others adopt similar techniques

Likely response:

Double down on safety and reliability (enterprise premium)
Specialized models (healthcare, legal, finance)
Partner with governments and large enterprises

Investment implication: Anthropic’s path to profitability harder if forced to cut prices 30-50%.

Application Layer Startups (Jasper, Copy.ai, etc.)

Current pain point:

API costs are 40-60% of revenue
Gross margins only 40-60%
Difficult to achieve profitability at scale

DeepSeek changes everything:

Can self-host DeepSeek V3.2 for fraction of API costs
Or use cheaper APIs from DeepSeek-style competitors
Gross margins improve to 70-85%
Path to profitability accelerates

Investment implication: Application layer companies become MORE attractive investments (better unit economics).

New Investment Opportunities

DeepSeek creates several new investment themes:

1. Domain-Specialized Efficient Models

Thesis: Train DeepSeek-style models specialized for specific verticals

Examples:

Medical AI: Train on medical literature, clinical trials, patient records
Legal AI: Train on case law, contracts, regulations
Financial AI: Train on financial reports, market data, transactions
Code AI: Enhanced coding models (10x GitHub data vs general models)

Why it works:

$5-10M training cost makes vertical models economically viable
Specialized models outperform general models in specific domains
Enterprises pay premium for domain expertise
Regulatory compliance easier with domain-specific models (no irrelevant capabilities)

Investment size: $20-40M to build specialized model company

2. Infrastructure for Efficient Training

Thesis: Tools and platforms to help others train DeepSeek-style models

Products:

MoE training frameworks (optimized for cost efficiency)
Data pipelines for trillion-token scale
Hyperparameter optimization for sparse models
Managed training platforms (bring your data, we train your model)

Examples (current landscape):

Together AI, Mosaic ML (acquired), Modal
Gap: No one focused specifically on cost-efficient training

Investment size: $15-30M

3. Model Evaluation and Testing

Thesis: As model proliferation accelerates, need better evaluation tools

Products:

Benchmark-as-a-service (evaluate on proprietary benchmarks)
Red-teaming services (find model weaknesses)
Performance comparison platforms
Domain-specific evaluation (medical accuracy, legal compliance)

Why now: With dozens of models (OpenAI, Anthropic, DeepSeek, Llama, etc.), choosing the right model gets harder

Investment size: $10-25M

4. Open Source Hosting and Fine-Tuning

Thesis: Demand for self-hosted DeepSeek alternatives will explode

Products:

Managed deployment of DeepSeek V3.2 (like Replicate, Hugging Face)
Fine-tuning platforms specialized for MoE models
Model optimization (quantization, distillation for deployment)

Addressable market: Every company currently using OpenAI API could be customer

Investment size: $20-40M

Risks to This Investment Thesis

I always consider downside scenarios:

Risk 1: OpenAI/Anthropic Price War

If OpenAI cuts prices 50-70% to maintain market share:

Startup competitors face margin compression
API cost advantage disappears
Self-hosting value proposition weakens

Likelihood: Moderate (30-40%)

OpenAI has shown willingness to cut prices (GPT-3.5 is cheap)
But they also need to justify $80B+ valuation to investors
Price war hurts everyone’s economics

Mitigation: Invest in companies with defensibility beyond price (quality, specialization, enterprise features)

Risk 2: DeepSeek Quality Doesn’t Hold Up

If real-world testing reveals DeepSeek is actually worse than GPT-4:

Benchmark scores were misleading (data contamination?)
Quality gaps emerge in production use
Customers choose GPT-4 despite higher cost

Likelihood: Low-Moderate (20-30%)

Some skepticism is warranted (Emily’s points)
But architecture is sound, many confirmations emerging
Even if 90% of GPT-4 quality, still valuable at 50% of cost

Mitigation: Due diligence includes extensive model testing before investment

Risk 3: Training Costs Rise Again

If next-generation models require $20-50M to train:

DeepSeek’s $5.6M was one-time efficiency gain
Keeping up with state-of-the-art requires more capital
Back to expensive mega-rounds

Likelihood: Moderate (30%)

AI capabilities may require more scale
Multimodal models (vision, audio) need more data/compute
But: Efficiency innovations continue (INT4, better architectures)

Mitigation: Invest in companies with ongoing efficiency R&D, not just training one model

Risk 4: Regulatory Intervention

Governments may regulate AI model training:

Licensing requirements for frontier models
Restrictions on open-source releases
Data privacy regulations increase training costs

Likelihood: Moderate-High (40-50%)

EU AI Act already in effect
US Congress considering legislation
China has model registration requirements

Mitigation: Invest in companies with strong regulatory/legal teams, enterprise focus

Portfolio Construction Strategy

Given these opportunities and risks, here’s how I’m adjusting our portfolio:

Reduce Exposure to:

Pure application layer companies without model differentiation
Expensive foundation model companies requiring $100M+ rounds
OpenAI/Anthropic “wrapper” companies with no defensibility

Increase Exposure to:

Domain-specialized model companies ($20-40M raises)
Infrastructure for efficient training and deployment
Enterprise AI with self-hosting options
Open-source focused companies (align with DeepSeek trend)

Target Portfolio Allocation (AI investments):

30%: Application layer with strong unit economics (using DeepSeek-style models)
30%: Domain-specialized efficient foundation models
20%: Infrastructure and tools
15%: Enterprise AI with self-hosting
5%: Evaluation, safety, governance

What I’m Looking For in Founders

If you’re building an AI company and seeking investment, here’s what gets my attention:

Must-Haves:

Deep technical expertise: Led large-scale model training before (OpenAI, Google, Meta, DeepSeek experience)
Cost-consciousness: Architecture and business model optimized for efficiency
Clear differentiation: Not just “GPT-4 but cheaper” – have a specific angle
Enterprise focus: B2B revenue model, not consumer (better margins, retention)

Nice-to-Haves:

Open source experience: Understanding of community-driven development
Domain expertise: If building vertical AI, deep domain knowledge
Previous successful exit: De-risks execution ability
Strong network: Can recruit top talent from major AI labs

Conclusion: A New Golden Age for AI Venture Capital

The $5.6M training cost is the most important development for AI venture capital since the GPT-3 API launch.

It transforms foundation models from mega-fund territory to traditional VC territory. This means:

More competition (dozens of well-funded model companies)
More innovation (can afford experimentation)
Better outcomes for application layer (lower costs, better margins)
Healthier ecosystem (less concentration of power)

For investors, this is exciting. The AI investment landscape was becoming bifurcated: mega-funds invest in foundation models, everyone else fights for application layer scraps. Now there’s a vibrant middle ground.

I expect 2026-2027 to see a wave of well-funded ($30-50M) efficient foundation model companies, all inspired by DeepSeek’s architecture. Some will fail, but a few will succeed and build substantial businesses.

The AI market is big enough for multiple winners. OpenAI and Anthropic won’t disappear, but they’ll face real competition from cost-efficient alternatives. That’s healthy.

For VCs willing to develop deep technical diligence capabilities (understanding MoE architectures, training efficiency, benchmark methodology), this is a generational investment opportunity.

I’m actively seeking AI companies to invest in. If you’re building an efficient foundation model or enabling infrastructure, reach out.

The DeepSeek era of AI venture capital is just beginning.

Lisa Zhang, Partner at mid-sized VC fund ($250M AUM), focused on AI/ML investments