Just came from the “AI Infrastructure at Scale” panel at SF Tech Week and the numbers are sobering. Let’s talk real costs.
Panelists: CTOs from Anthropic, Stability AI, Midjourney, plus Crusoe Energy (GPU cloud)
The GPU Crisis is Real
Current H100 Market (October 2025):
- H100 spot price: $2.89/hour (down from $4.50 in June)
- H100 reserved (1-year): $1.95/hour
- A100 spot price: $1.10/hour
- Cloud markup: 40-60% vs bare metal
Why prices dropped: Major cloud providers finally got supply, NVIDIA shipped 500K+ H100s in Q3
But here’s the catch: H200s launching in Q1 2026, and everyone will want to upgrade
Training Cost Reality Check
Anthropic shared actual numbers for Claude models:
- Claude 2: $15M-$20M in compute (2023)
- Claude 3: $35M-$45M in compute (2024)
- Next generation: Estimated $80M-$120M
Stability AI’s Stable Diffusion 3:
- Training cost: $6M
- 16,384 A100s for 3 weeks
- Plus data processing, storage, failed experiments
OpenAI GPT-4 (reported):
- Estimated training cost: $100M-$150M
- 25,000 A100s for 3-4 months
These are TRAINING costs only - not infrastructure, salaries, data, R&D.
Inference Cost Structure
Midjourney CEO shared their economics:
Revenue: $200M ARR
Inference costs: $50M-$60M annually (25-30% of revenue)
GPU fleet: 30,000+ A100s
Cost per image generation:
- High quality: $0.032
- Standard: $0.018
- Price to user: $0.04 (paid tier)
Gross margin on compute: 20-25%
This is AFTER massive optimization. Year 1 they lost money on every image.
The Elephant in the Room: Who Can Actually Afford This?
Crusoe Energy’s brutal breakdown:
Minimum viable AI startup infrastructure:
- 64 H100s (small cluster): $12K/month reserved, $18K spot
- Storage for training data: $5K/month
- Networking: $3K/month
- Total: $20K-$26K/month = $240K-$312K/year
And that’s for a SMALL training run. Most serious models need 256+ GPUs.
Series A startup ($4M raise, 18-month runway):
- AI infrastructure: $300K-$500K/year
- That’s 12.5% of your runway on GPUs alone
- Before salaries, before data, before anything else
Cloud vs On-Premise Decision
Panel consensus:
Under $500K annual GPU spend: Cloud (AWS, GCP, Azure)
- Flexibility, no upfront capex
- But 50% markup over bare metal
$500K-$2M annual: Specialized GPU clouds (Crusoe, Lambda, CoreWeave)
- 30% cheaper than hyperscalers
- Good support, fast provisioning
Over $2M annual: Consider on-premise or colo
- Anthropic example: Own data centers in partnership with Equinix
- Upfront cost: $3M-$5M for 1,024 H100 cluster
- Break-even: 18-24 months vs cloud
- Only makes sense if you’ll use it for 3+ years
The Hidden Costs Nobody Talks About
Storage:
- Training datasets: 100TB-1PB
- Model checkpoints: 10-50TB per training run
- Logs and telemetry: 5-10TB/month
- Cost: $0.023/GB/month (S3) = $23K-$230K/month for data
Networking:
- Inter-GPU bandwidth crucial for distributed training
- InfiniBand clusters: $200K-$500K infrastructure
- Cross-region data transfer: $0.09/GB (adds up FAST)
Talent:
- ML infrastructure engineer: $250K-$400K
- You need 2-3 for 24/7 coverage
- Plus ML engineers, data engineers
Failed Experiments:
- Stability AI: “For every successful model, we have 10-15 failed training runs”
- Failed runs still cost money - budget 2-3x your successful training cost
FinOps for AI: Cost Control Strategies
Strategies shared by panelists:
1. Spot instance arbitrage
- Midjourney: 60% inference on spot instances
- Automatic fallback to reserved
- Saves 40% on compute
2. Multi-cloud strategy
- Play AWS vs GCP vs Azure pricing
- “We moved 30% workload to GCP when they offered 25% discount” - Stability AI
3. Model optimization
- Quantization: 8-bit models use 4x less memory
- Distillation: Smaller models for inference
- Midjourney cut inference costs 60% through optimization
4. Smart batching
- Batch inference requests
- Higher GPU utilization
- 2-3x better cost efficiency
5. Geographic arbitrage
- Oregon (cheap hydro power): $1.80/hour H100
- Northern Virginia (demand): $2.50/hour H100
- 40% cost difference for same hardware
ROI Reality Check
Question I asked: “When do AI investments pay back?”
Anthropic CTO: “If you’re selling AI products, you need 60%+ gross margins to survive. Infrastructure costs eat 30-40% of revenue in year 1-2. Optimize down to 20-25% by year 3-4.”
Midjourney CEO: “We didn’t hit positive unit economics until month 18. Had to raise $50M to get there. If you don’t have 2+ years of runway, don’t start a high-compute AI company.”
Stability AI VP Eng: “Our burn rate was $5M/month at peak, 80% on infrastructure. We had to dramatically cut experiments and focus only on products with clear revenue path.”
My Takeaway for Our Startup
We’re building AI-powered analytics. Current spend: $8K/month on inference.
Planning for next 12 months:
- Current trajectory: $8K → $25K/month as we scale users
- That’s $300K annual run rate
- Need to get gross margins above 70% to make economics work
- Optimization roadmap: Model distillation, caching, batching
The sobering truth: AI infrastructure costs scale with users faster than revenue. You MUST have a plan for unit economics from day one.
Anyone else dealing with runaway GPU costs?
Michelle
Reporting from SF Tech Week - Moscone Center, “AI Infrastructure at Scale” panel
Sources:
- Anthropic, Stability AI, Midjourney CTOs (live panel)
- Crusoe Energy pricing data
- AWS/GCP/Azure public pricing