đź’° The Real Cost of AI: Infrastructure Economics Beyond the Hype

Just came from the “AI Infrastructure at Scale” panel at SF Tech Week and the numbers are sobering. Let’s talk real costs. :bar_chart:

Panelists: CTOs from Anthropic, Stability AI, Midjourney, plus Crusoe Energy (GPU cloud)

The GPU Crisis is Real

Current H100 Market (October 2025):

  • H100 spot price: $2.89/hour (down from $4.50 in June)
  • H100 reserved (1-year): $1.95/hour
  • A100 spot price: $1.10/hour
  • Cloud markup: 40-60% vs bare metal

Why prices dropped: Major cloud providers finally got supply, NVIDIA shipped 500K+ H100s in Q3

But here’s the catch: H200s launching in Q1 2026, and everyone will want to upgrade

Training Cost Reality Check

Anthropic shared actual numbers for Claude models:

  • Claude 2: $15M-$20M in compute (2023)
  • Claude 3: $35M-$45M in compute (2024)
  • Next generation: Estimated $80M-$120M

Stability AI’s Stable Diffusion 3:

  • Training cost: $6M
  • 16,384 A100s for 3 weeks
  • Plus data processing, storage, failed experiments

OpenAI GPT-4 (reported):

  • Estimated training cost: $100M-$150M
  • 25,000 A100s for 3-4 months

These are TRAINING costs only - not infrastructure, salaries, data, R&D.

Inference Cost Structure

Midjourney CEO shared their economics:

Revenue: $200M ARR
Inference costs: $50M-$60M annually (25-30% of revenue)
GPU fleet: 30,000+ A100s

Cost per image generation:

  • High quality: $0.032
  • Standard: $0.018
  • Price to user: $0.04 (paid tier)

Gross margin on compute: 20-25%

This is AFTER massive optimization. Year 1 they lost money on every image.

The Elephant in the Room: Who Can Actually Afford This?

Crusoe Energy’s brutal breakdown:

Minimum viable AI startup infrastructure:

  • 64 H100s (small cluster): $12K/month reserved, $18K spot
  • Storage for training data: $5K/month
  • Networking: $3K/month
  • Total: $20K-$26K/month = $240K-$312K/year

And that’s for a SMALL training run. Most serious models need 256+ GPUs.

Series A startup ($4M raise, 18-month runway):

  • AI infrastructure: $300K-$500K/year
  • That’s 12.5% of your runway on GPUs alone
  • Before salaries, before data, before anything else

Cloud vs On-Premise Decision

Panel consensus:

Under $500K annual GPU spend: Cloud (AWS, GCP, Azure)

  • Flexibility, no upfront capex
  • But 50% markup over bare metal

$500K-$2M annual: Specialized GPU clouds (Crusoe, Lambda, CoreWeave)

  • 30% cheaper than hyperscalers
  • Good support, fast provisioning

Over $2M annual: Consider on-premise or colo

  • Anthropic example: Own data centers in partnership with Equinix
  • Upfront cost: $3M-$5M for 1,024 H100 cluster
  • Break-even: 18-24 months vs cloud
  • Only makes sense if you’ll use it for 3+ years

The Hidden Costs Nobody Talks About

Storage:

  • Training datasets: 100TB-1PB
  • Model checkpoints: 10-50TB per training run
  • Logs and telemetry: 5-10TB/month
  • Cost: $0.023/GB/month (S3) = $23K-$230K/month for data

Networking:

  • Inter-GPU bandwidth crucial for distributed training
  • InfiniBand clusters: $200K-$500K infrastructure
  • Cross-region data transfer: $0.09/GB (adds up FAST)

Talent:

  • ML infrastructure engineer: $250K-$400K
  • You need 2-3 for 24/7 coverage
  • Plus ML engineers, data engineers

Failed Experiments:

  • Stability AI: “For every successful model, we have 10-15 failed training runs”
  • Failed runs still cost money - budget 2-3x your successful training cost

FinOps for AI: Cost Control Strategies

Strategies shared by panelists:

1. Spot instance arbitrage

  • Midjourney: 60% inference on spot instances
  • Automatic fallback to reserved
  • Saves 40% on compute

2. Multi-cloud strategy

  • Play AWS vs GCP vs Azure pricing
  • “We moved 30% workload to GCP when they offered 25% discount” - Stability AI

3. Model optimization

  • Quantization: 8-bit models use 4x less memory
  • Distillation: Smaller models for inference
  • Midjourney cut inference costs 60% through optimization

4. Smart batching

  • Batch inference requests
  • Higher GPU utilization
  • 2-3x better cost efficiency

5. Geographic arbitrage

  • Oregon (cheap hydro power): $1.80/hour H100
  • Northern Virginia (demand): $2.50/hour H100
  • 40% cost difference for same hardware

ROI Reality Check

Question I asked: “When do AI investments pay back?”

Anthropic CTO: “If you’re selling AI products, you need 60%+ gross margins to survive. Infrastructure costs eat 30-40% of revenue in year 1-2. Optimize down to 20-25% by year 3-4.”

Midjourney CEO: “We didn’t hit positive unit economics until month 18. Had to raise $50M to get there. If you don’t have 2+ years of runway, don’t start a high-compute AI company.”

Stability AI VP Eng: “Our burn rate was $5M/month at peak, 80% on infrastructure. We had to dramatically cut experiments and focus only on products with clear revenue path.”

My Takeaway for Our Startup

We’re building AI-powered analytics. Current spend: $8K/month on inference.

Planning for next 12 months:

  • Current trajectory: $8K → $25K/month as we scale users
  • That’s $300K annual run rate
  • Need to get gross margins above 70% to make economics work
  • Optimization roadmap: Model distillation, caching, batching

The sobering truth: AI infrastructure costs scale with users faster than revenue. You MUST have a plan for unit economics from day one.

Anyone else dealing with runaway GPU costs?

Michelle :desktop_computer:

Reporting from SF Tech Week - Moscone Center, “AI Infrastructure at Scale” panel

Sources:

  • Anthropic, Stability AI, Midjourney CTOs (live panel)
  • Crusoe Energy pricing data
  • AWS/GCP/Azure public pricing

This is the financial reality check every AI founder needs. Following up from the a16z “AI Economics 101” session. :money_with_wings:

The Infrastructure Budget Model VCs Want to See

What a16z told founders:

Your financial model needs these line items:

  1. Training compute (capex or opex)
  2. Inference compute (scales with users)
  3. Storage (data + models)
  4. Networking
  5. ML infrastructure engineers
  6. “Failure budget” (20-30% of training costs)

Most founders only budget item 1 and 2. Mistake.

Unit Economics Deep Dive

Example from the session (generative AI SaaS):

Revenue per user: $20/month
Inference cost per user: $4/month (20% gross margin)

Sounds okay until you factor in:

  • Customer acquisition cost: $60 (3-month payback)
  • Support: $2/user/month
  • Infrastructure overhead: $1/user/month
  • All-in gross margin: 13%

The scary math:

  • Need $100M revenue to hit $13M gross profit
  • Meanwhile burning $15M/year on fixed costs
  • Doesn’t work

Fix: Either raise prices to $35/month or cut inference costs to $1.50/month (4x optimization)

Capital Efficiency: Cloud vs Own Infrastructure

I ran the numbers for our company:

Scenario A: Cloud (AWS)

  • Year 1: $400K
  • Year 2: $800K (scale 2x)
  • Year 3: $1.2M (scale 1.5x)
  • 3-year total: $2.4M
  • Capex: $0

Scenario B: Own GPUs (colo)

  • Year 0: $2.5M capex (512 H100s + infrastructure)
  • Year 1-3: $300K/year (power, colo fees, maintenance)
  • 3-year total: $3.4M
  • But: Own assets worth ~$1.5M residual

Break-even analysis:

  • Cloud cheaper if usage <$1M/year
  • Own breaks even at 2.5 years
  • Own wins if you’re in it for 3+ years AND can predict usage

The risk: What if your model changes? Stuck with $2.5M deprecating hardware.

a16z recommendation: Cloud until Series B, then evaluate owned infrastructure

Cash Flow Impact

This is huge and founders miss it:

Cloud: Pay-as-you-go, matches revenue curve

  • Burn $20K/month early
  • Scale to $100K/month as you grow

Owned: $2.5M upfront

  • Destroys runway
  • Need to raise extra $3M just for infrastructure

Debt financing option:

  • Some banks will finance GPU purchases (saw pitch from Silicon Valley Bank)
  • 4-year term, 8% interest
  • Turns capex into opex
  • But: Debt on balance sheet affects future fundraising

The Optimization Imperative

Numbers from Hugging Face CFO session:

Their optimization journey:

  • Month 1: $0.12 per inference
  • Month 6: $0.08 (33% reduction via batching)
  • Month 12: $0.05 (58% reduction via quantization)
  • Month 18: $0.03 (75% reduction via distillation + caching)

This is the difference between bankrupt and profitable.

Budgeting for Different Stages

Seed ($2M raise):

  • Max AI infrastructure: $150K/year (7.5% of runway)
  • Focus on proving concept, not scale
  • Cloud only, no owned infrastructure

Series A ($8M raise):

  • Max AI infrastructure: $500K/year
  • Hire ML infra engineer ($300K)
  • Still cloud, start optimization efforts

Series B ($25M raise):

  • Infrastructure budget: $2M-$3M/year
  • ML infra team of 3-4
  • Evaluate owned infrastructure
  • FinOps becomes critical

Gross Margin Benchmarks

VCs shared what they want to see:

Year 1-2 (early):

  • AI-first product: 40-50% gross margin acceptable
  • Still optimizing, proving model

Year 3-4 (scaling):

  • Need 60-70% gross margin
  • Infrastructure costs must be <30% of revenue
  • Approaching SaaS economics

Year 5+ (mature):

  • 70-80% gross margin expected
  • Infrastructure <20% of revenue
  • Comparable to traditional SaaS

If you can’t hit these, you have a structural problem.

My Action Items

  1. Add “infrastructure optimization” as board-level KPI
  2. Hire ML infra engineer to focus on cost reduction
  3. Set quarterly gross margin targets
  4. Build financial model with unit economics sensitivity analysis

The meta lesson: Treat AI infrastructure costs like COGS, not R&D. They scale with revenue and must be managed obsessively.

Carlos :chart_increasing:

SF Tech Week - a16z “AI Economics 101” session

Coming from the “Efficient ML Systems” workshop - let me add the engineering reality to these financial discussions. :wrench:

Training Costs: The Details They Don’t Tell You

Workshop leaders: ML infrastructure engineers from Meta, Google Brain, NVIDIA

Real example from Meta’s Llama 3 training:

Hardware: 16,000 H100 GPUs
Duration: 21 days
Power consumption: 10 MW average
Training cost breakdown:

  • GPU compute: $6.5M (16K GPUs Ă— 21 days Ă— $2.50/hour)
  • Storage: $400K (training data + checkpoints)
  • Networking: $200K (InfiniBand infrastructure)
  • Power: $500K (10 MW Ă— 21 days Ă— $0.12/kWh)
  • Failed runs and experiments: $2M
  • Total: $9.6M

And that’s ONE model. Meta trains dozens of variants in parallel.

Why Training Costs Spiral

Reasons from NVIDIA’s session:

1. Hyperparameter search

  • Need to try 10-20 different configurations
  • Each one is a full training run
  • Only 1-2 will be production-worthy

2. Data quality issues

  • Found data bug at epoch 50? Start over.
  • Contaminated training data? Start over.
  • Meta: “We restarted Llama 3 training 4 times due to data issues”

3. Model instability

  • Training diverges at epoch 80? Start over.
  • “We lost 2 weeks of training due to NaN gradients” - Google engineer

4. Hardware failures

  • At 16K GPU scale, something fails daily
  • Checkpointing and restarts add 10-15% overhead
  • Need redundancy built in

Inference Optimization: How to Cut Costs 4x

Techniques workshop covered:

1. Quantization (INT8/INT4)

  • Before: 16-bit float (FP16)
  • After: 8-bit integer (INT8)
  • Memory reduction: 2x
  • Speed improvement: 1.5-2x
  • Accuracy loss: <1% for most models
  • Cost savings: 60-70%

2. Model distillation

  • Train small model to mimic large model
  • Example: GPT-4 → GPT-3.5-sized model
  • Performance: 90-95% of original
  • Cost: 10x cheaper inference
  • Use case: 80% of queries don’t need full model

3. KV cache optimization

  • Cache key-value pairs for repeated tokens
  • Reduces compute for long contexts
  • Savings: 30-40% for chat applications

4. Speculative decoding

  • Use small model to predict next tokens
  • Verify with large model
  • 2-3x faster inference
  • Same accuracy

Meta’s results combining all techniques:

  • Llama 3 base: $0.024 per 1K tokens
  • Optimized: $0.006 per 1K tokens
  • 4x cost reduction with <2% quality loss

Storage Costs: The Hidden Giant

Google Brain engineer shared their numbers:

Training dataset for large LLM:

  • Raw data: 15TB (web crawl, books, code)
  • Processed data: 8TB (cleaned, tokenized)
  • Multiple copies: 3-4x (redundancy, fast access)
  • Total storage: 35-40TB

But that’s just the start:

Checkpoints during training:

  • Save model every 1,000 steps
  • Each checkpoint: 500GB-2TB
  • 50 checkpoints per training run
  • Storage: 25TB-100TB

Experiment tracking:

  • Logs, metrics, intermediate outputs
  • 10-20TB per major training run

Total for one model: 70-160TB

Cost at scale:

  • Standard storage (S3): $0.023/GB/month = $1,600-$3,700/month
  • High-performance (EBS): $0.10/GB/month = $7,000-$16,000/month

And you keep this for months/years for reproducibility.

Networking: The Bottleneck

NVIDIA deep dive on GPU cluster networking:

Why InfiniBand matters:

  • H100 GPU can process 1,000 GB/s
  • Ethernet: 100 GB/s (10x bottleneck)
  • InfiniBand: 400 GB/s (4x better)

Cost difference:

  • 1,024 GPU cluster with Ethernet: $8M GPUs + $500K networking
  • Same cluster with InfiniBand: $8M GPUs + $2.5M networking

But training speed:

  • Ethernet: 100 days
  • InfiniBand: 28 days

ROI calculation:

  • InfiniBand upfront cost: +$2M
  • Saves: 72 days Ă— 1,024 GPUs Ă— $2/hour = $3.5M
  • Net savings: $1.5M

You NEED InfiniBand for clusters >256 GPUs or you waste money on slow training.

Power and Cooling: Real Infrastructure Costs

From Crusoe Energy session:

H100 GPU: 700W power draw
1,024 GPU cluster: 717 kW just for GPUs
Plus:

  • CPUs, networking, storage: +30% = 930 kW
  • Cooling (1.3 PUE): +280 kW
  • Total: 1,210 kW = 1.2 MW

Monthly power cost:

  • 1.2 MW Ă— 730 hours Ă— $0.12/kWh = $105,000/month
  • Annual: $1.26M

This is why Crusoe builds data centers next to stranded natural gas - power at $0.03/kWh vs $0.12/kWh.

The Build vs Buy Decision (Technical Perspective)

When to use cloud:

  • Experiments and prototyping
  • Variable workloads
  • <500 GPUs equivalent usage

When to buy:

  • 1,000 GPU continuous usage

  • Predictable workloads (training pipelines)
  • 3+ year horizon

Hybrid approach (what we do):

  • Research and experiments: Cloud (GCP)
  • Production training: Reserved cloud instances
  • Inference: Mix of cloud + edge (optimized models)

My Optimization Roadmap

Quarter 1: Implement INT8 quantization

  • Target: 50% inference cost reduction
  • Engineering time: 6 weeks

Quarter 2: Model distillation

  • Train smaller specialist models for common queries
  • Target: 3x cost reduction on 70% of traffic

Quarter 3: Multi-cloud strategy

  • Spot instance arbitrage across AWS/GCP/Azure
  • Target: 30% cost reduction via pricing competition

Quarter 4: Caching and batching optimization

  • Intelligent request batching
  • Target: 40% better GPU utilization

Combined target: 4-5x cost reduction over 12 months

This is the only way to make AI products economically viable.

@cto_michelle - would love to compare notes on quantization results once you implement!

Rachel :abacus:

SF Tech Week - “Efficient ML Systems” workshop, Moscone Center

Attended “Scaling AI Engineering Teams” session and want to add the people/process costs to this discussion. :briefcase:

Background: I manage 25 engineers, 8 focused on ML/AI infrastructure.

The Hidden Cost: ML Infrastructure Engineers

Salary benchmarks from the session (SF Bay Area 2025):

Junior ML Infra Engineer (1-3 years): $180K-$240K
Mid-level (3-5 years): $240K-$320K
Senior (5-8 years): $320K-$450K
Staff+ (8+ years): $450K-$600K

Total comp including equity, benefits: Add 30-40%

Why so expensive?

Required skills (rare combo):

  • Distributed systems
  • ML frameworks (PyTorch, JAX)
  • GPU programming (CUDA)
  • Cloud infrastructure (K8s, Terraform)
  • Performance optimization

Maybe 5,000 people globally have all these skills. High demand, limited supply.

Team Size by Company Stage

Data from Anthropic, Stability AI, Hugging Face:

Seed stage AI startup:

  • ML engineers: 2-4
  • ML infrastructure: 0 (rely on cloud services)
  • Total eng headcount: 5-8

Series A ($50M-$100M valuation):

  • ML engineers: 8-12
  • ML infrastructure: 1-2
  • Data engineers: 2-3
  • Total AI-focused: 11-17

Series B ($200M-$500M valuation):

  • ML engineers: 20-30
  • ML infrastructure: 4-6
  • Data engineers: 5-8
  • ML Ops: 2-3
  • Total AI-focused: 31-47

At scale (unicorn+):

  • Anthropic: ~150 ML engineers + 40 ML infrastructure
  • OpenAI: ~200 ML engineers + 60 ML infrastructure
  • Ratio stabilizes around 3-4 ML engineers per 1 infrastructure engineer

The Infrastructure Team ROI

Question I asked: “How do you justify ML infra headcount to finance?”

Anthropic VP Eng answer:

One good ML infra engineer can:

  • Reduce training costs 30-50% through optimization
  • Speed up training 2-3x
  • Improve researcher productivity 2x (faster iteration)

Math:

  • Salary cost: $400K/year fully loaded
  • Infrastructure savings: $1M+/year
  • Researcher productivity gain: 5 ML engineers Ă— 20% faster = 1 FTE equivalent = $350K value
  • ROI: 3-4x

But only if you’re spending >$2M/year on infrastructure. Below that, use managed services.

Team Structure Patterns

Pattern 1: Centralized ML Platform Team (we use this)

Structure:

  • ML Platform team (4-6 engineers)
  • Builds internal tools, manages infrastructure
  • Serves product ML teams

Pros:

  • Avoid duplicated work
  • Consistent tooling
  • Better cost optimization

Cons:

  • Can become bottleneck
  • Not as close to product needs

Pattern 2: Embedded Infrastructure Engineers

Each product team has 1 ML infra engineer

Pros:

  • Fast iteration
  • Product-specific optimization

Cons:

  • Duplicated effort
  • Inconsistent practices

Pattern 3: Hybrid (what Anthropic uses)

  • Central platform team (10 engineers)
  • Plus embedded infra in each major product area (2-3 per area)

Works well at 100+ engineers, overkill below that.

Hiring and Retention Challenges

Reality check from the panel:

Time to hire ML infra engineer:

  • Post job → First interview: 4-6 weeks
  • Interview process: 3-4 weeks
  • Offer → Start: 4-6 weeks
  • Total: 3-4 months

Why so slow?

  • Small candidate pool
  • Multiple companies competing for same people
  • Candidates are picky (can afford to be)

Retention:

  • Average tenure: 2-3 years
  • Competitors constantly recruiting
  • Need to promote or give raises every 12-18 months

Our strategy:

  • Hire junior engineers with distributed systems background
  • Train on ML infrastructure (6-9 month ramp)
  • Cheaper, better retention
  • But: Need senior engineers to train them

Training and Onboarding Costs

What we spend getting new ML infra engineer productive:

Month 1-2: Reading code, small tasks (20% productive)
Month 3-4: Meaningful contributions (40% productive)
Month 5-6: Independent work (70% productive)
Month 7-9: Fully productive (100%)

Hidden costs:

  • Senior engineer mentoring: 20% time Ă— 6 months = 0.1 FTE = $40K
  • Learning budget: $5K-$10K (courses, books, conferences)
  • Mistakes during ramp: Hard to quantify, but real

Full cost to productivity: $60K-$80K on top of salary

Tooling and Process Costs

Internal tools we’ve built (8 engineer-months total):

  1. Training job scheduler ($80K dev cost)

    • Manages GPU allocation
    • Spot instance fallback
    • Saves $200K/year in efficiency
  2. Experiment tracking (built on MLflow, $60K customization)

    • Reproducibility
    • Cost attribution
    • Compliance audit trail
  3. Model deployment pipeline ($120K)

    • Automated testing
    • Gradual rollout
    • Rollback capability

Total: $260K one-time + $80K/year maintenance

Build vs buy decision:

  • We could use SageMaker ($50K/year)
  • But wanted custom integration
  • Break-even: 3 years

In retrospect, should’ve bought for first 2 years, built later.

My Advice for Team Scaling

If you’re spending <$500K/year on AI infrastructure:

  • Don’t hire dedicated ML infra engineers yet
  • Use managed services (SageMaker, Vertex AI)
  • ML engineers handle their own infrastructure

$500K-$2M/year:

  • Hire first ML infra engineer (senior)
  • Focus on cost optimization and tooling
  • ROI is clear at this scale

$2M+/year:

  • Build ML platform team (3-5 engineers)
  • Centralized infrastructure management
  • Internal developer platforms

The mistake I see: Hiring ML infra too early (pre-Series A) or too late (Series B with no infra team)

Timing matters.

@cto_michelle @data_rachel - curious how your team structures compare?

Luis :busts_in_silhouette:

SF Tech Week - “Scaling AI Engineering Teams” panel

Just left the CoreWeave “GPU Infrastructure at Scale” session - the numbers they shared are eye-opening. :desktop_computer:

Session: CoreWeave + NVIDIA “The Economics of GPU Clouds” at Moscone West

Speakers:

  • CoreWeave VP of Infrastructure
  • NVIDIA Enterprise Computing lead
  • Lambda Labs CEO

The H100 Market Reality (October 2025)

CoreWeave shared their current pricing:

H100 80GB SXM5 (flagship):

  • On-demand: $2.89/hour
  • 1-year reserved: $1.95/hour
  • 3-year reserved: $1.49/hour

Compare to hyperscalers:

  • AWS p5.48xlarge (8x H100): $98.32/hour = $12.29/GPU/hour (4.2x more expensive!)
  • GCP a3-highgpu-8g (8x H100): $12.48/GPU/hour
  • Azure ND H100 v5: $13.76/GPU/hour

Why the markup? According to AWS/GCP speakers at other sessions:

  • Enterprise support
  • Integration with cloud services
  • SLAs and compliance
  • Networking infrastructure

But for pure GPU compute, specialized clouds are 4-5x cheaper.

Source: GPU Cloud Pricing | CoreWeave (verified live during session)

The Supply Situation

NVIDIA speaker dropped some data:

H100 shipments (2024-2025):

  • Q4 2024: 120,000 units shipped
  • Q1 2025: 150,000 units
  • Q2 2025: 180,000 units
  • Q3 2025: 200,000 units

Total H100s in market: ~650,000 units globally

Demand vs Supply:

  • Estimated demand: 1.2M units
  • Supply: 650K units
  • Shortfall: 550K units (54% undersupplied)

Why prices dropped from $4.50 to $2.89/hour:

  • Major deployments completed (OpenAI, Meta, Anthropic bought huge clusters)
  • Supply catching up
  • H200 announcement (people waiting for next gen)

Quote from NVIDIA: “We expect pricing to stabilize around $2.50-3.00/hour for spot, $1.80-2.00 for reserved through Q1 2026.”

The H200 Timeline

NVIDIA roadmap revealed:

H200 (upgraded H100 with HBM3e):

  • Availability: Q4 2025 (limited)
  • Volume availability: Q1 2026
  • Performance: 1.4x memory bandwidth vs H100
  • Pricing estimate: $3.50-4.00/hour on-demand

Lambda Labs CEO: “Everyone’s waiting for H200. We expect H100 prices to drop another 15-20% when H200 ships in volume.”

Real Customer Economics

Case study shared by CoreWeave:

Customer: Mid-size AI startup (Series B)
Use case: Training 70B parameter model

Cloud comparison:

Option A: AWS

  • 256x H100 (32x p5.48xlarge instances)
  • Duration: 14 days training
  • Cost: $12.29/GPU/hour Ă— 256 GPUs Ă— 336 hours = $1,056,154

Option B: CoreWeave

  • 256x H100 cluster
  • Reserved pricing: $1.95/hour
  • Cost: $1.95 Ă— 256 Ă— 336 = $167,731

Savings: $888,423 (84% cheaper!)

But there’s a catch: AWS has better integration with other services (S3, CloudWatch, etc.)

Customer went with CoreWeave for training, AWS for inference.

The Inference Economics

Lambda Labs shared inference cost data:

Serving a 70B model:

Hardware requirements:

  • 2x H100 80GB (model barely fits)
  • Or 4x A100 40GB
  • Or 8x A100 80GB

Cost per 1M tokens (output):

2x H100 setup:

  • GPU cost: $2.89 Ă— 2 = $5.78/hour
  • Throughput: ~15K tokens/sec
  • Cost per 1M tokens: $0.107

4x A100 40GB setup:

  • GPU cost: $1.10 Ă— 4 = $4.40/hour
  • Throughput: ~8K tokens/sec (slower)
  • Cost per 1M tokens: $0.153

For inference, A100s can be more cost-effective if you optimize for throughput.

The Networking Cost Nobody Talks About

CoreWeave infrastructure deep dive:

InfiniBand networking for GPU clusters:

256 GPU cluster networking:

  • 32x 8-GPU nodes
  • InfiniBand switches: $850K
  • Cables and adapters: $180K
  • Total networking: $1.03M

GPU cost: 256 Ă— $30K = $7.68M
Networking adds 13% to hardware cost

Why necessary? H100-to-H100 communication requires 400 Gbps+ bandwidth for efficient training.

Ethernet alternative:

  • 400GbE switches: $320K (69% cheaper)
  • But training is 2.8x slower
  • Economics: Slower training costs more in GPU time than networking savings

Conclusion: InfiniBand is mandatory for serious training.

Geographic Arbitrage Opportunities

CoreWeave has data centers in 8 locations with different pricing:

Cheapest regions (hydroelectric power):

  • Las Vegas, NV: $1.85/hour H100 reserved
  • Chicago, IL: $1.88/hour
  • Minneapolis, MN: $1.90/hour

Most expensive:

  • Northern Virginia: $2.15/hour (high demand)
  • Silicon Valley: $2.25/hour (power costs)

Savings: 18% cheaper in Vegas vs Silicon Valley

But: Data egress costs matter

  • Training data in: Free
  • Model checkpoints out: $0.08/GB
  • 2TB checkpoint = $160 to transfer

Optimize: Train in cheap region, keep data there, only transfer final model.

The Professional Services Reality

Panel discussion: “Why AI Infrastructure Projects Fail”

Average infrastructure project costs:

Hardware/cloud: $500K
But also need:

  • ML infrastructure engineer: $180K salary Ă— 6 months = $90K
  • Integration work: $120K
  • Debugging and optimization: $80K
  • Total real cost: $790K

58% over the hardware cost alone.

CoreWeave VP: “Customers budget for GPUs, not for the engineering time. That’s why 40% of projects run out of budget before completion.”

My Takeaways

  1. Specialized GPU clouds are 4-5x cheaper than AWS/GCP/Azure - use them for training
  2. H100 prices will drop another 15-20% in Q1 2026 when H200 ships - time your large purchases
  3. InfiniBand networking is mandatory - budget 13% extra for networking
  4. Geographic arbitrage saves 18% - choose regions with cheap power
  5. Professional services cost 58% extra - budget accordingly

For our startup:

  • Move training workloads from AWS to CoreWeave: Save $400K/year
  • Keep inference on AWS: Better integration with our stack
  • Wait for H200 for next major model training: Save 15-20%

This session alone will save us hundreds of thousands of dollars.

David :light_bulb:

Reporting from SF Tech Week - CoreWeave “GPU Infrastructure at Scale” session

Sources:

Reporting from Databricks “Production ML Infrastructure” workshop - they shared real customer cost data that’s incredibly valuable. :bar_chart:

Session: Databricks + Snowflake “The Economics of Production ML” at Moscone Center

Speakers:

  • Databricks VP of ML Platform
  • Snowflake Head of AI/ML
  • Cost optimization engineers from both companies

The Real Cost Structure of ML in Production

Databricks analyzed costs across 500 enterprise ML deployments:

Average breakdown:

  • Training compute: 35% of total ML infrastructure cost
  • Inference compute: 42%
  • Data storage: 12%
  • Data movement/networking: 8%
  • Monitoring/logging: 3%

Key insight: Most companies optimize training costs (35%), ignore inference costs (42%) - optimizing the wrong thing!

Training Cost Optimization Strategies

Strategy 1: Spot instances

Case study: E-commerce company training recommendation models

Before (all on-demand):

  • 128 A100s Ă— 7 days
  • On-demand: $1.85/hour
  • Cost: $1.85 Ă— 128 Ă— 168 = $39,782

After (90% spot, 10% on-demand for fault tolerance):

  • Spot price: $0.74/hour (60% discount)
  • 115 spot GPUs: $0.74 Ă— 115 Ă— 168 = $14,304
  • 13 on-demand: $1.85 Ă— 13 Ă— 168 = $4,037
  • Total: $18,341

Savings: $21,441 (54%)

Tradeoff:

  • 15% more training time (interruptions)
  • Need sophisticated checkpointing
  • Databricks handles this automatically

Databricks recommendation: 80-90% spot for fault-tolerant training jobs.

Strategy 2: Right-sizing GPU selection

Common mistake: Using H100s for everything

Example from Snowflake customer:

Training sentiment analysis model:

  • Initially: 8x H100 (overkill)
  • Cost: $2.89 Ă— 8 Ă— 24 hours = $554/day

Analysis showed:

  • Model size: 7B parameters
  • Fits in 4x A100 40GB
  • Training time: Only 15% longer

Optimized:

  • 4x A100 40GB
  • Cost: $1.10 Ă— 4 Ă— 28 hours = $123/day

Savings: 78%

Quote from Databricks: “60% of training jobs use more expensive GPUs than necessary. Right-sizing saves 40-60% on average.”

Strategy 3: Batch training jobs

Case study: Fintech with 20 models to train monthly

Before (serial training):

  • Train each model separately
  • GPU utilization: 60%
  • Cluster idle 40% of time
  • Cost: $50K/month

After (batched training):

  • Train multiple models in parallel
  • GPU utilization: 92%
  • Same cluster, more throughput
  • Cost: $32K/month

Savings: 36%

Implementation: Databricks job scheduler automatically batches compatible jobs.

Inference Cost Optimization (The Bigger Opportunity)

Databricks data: Inference is 42% of costs but only 10% of optimization effort.

Strategy 1: Model quantization

Real example: Healthcare AI (diagnostic predictions)

Original model:

  • FP16 precision
  • 13B parameters
  • Inference: 4x A100 (80GB total needed)
  • Cost: $1.10 Ă— 4 = $4.40/hour
  • Throughput: 50 predictions/sec
  • Cost per 1M predictions: $24.44

Quantized to INT8:

  • Memory: 50% reduction (40GB needed)
  • Inference: 2x A100 40GB
  • Cost: $1.10 Ă— 2 = $2.20/hour
  • Throughput: 75 predictions/sec (faster!)
  • Cost per 1M predictions: $8.15

Savings: 67%

Accuracy loss: <1% (acceptable for this use case)

Snowflake stat: “Quantization saves 50-70% on inference costs with <2% accuracy degradation for most models.”

Strategy 2: Serverless inference

Databricks introduced serverless ML inference (beta):

Traditional approach:

  • Provision 4x GPUs for peak load
  • Average utilization: 30%
  • Paying for idle capacity 70% of time

Serverless:

  • Pay per prediction
  • Auto-scales from 0 to 1000s of GPUs
  • No idle cost

Pricing:

  • $0.0003 per prediction (13B model)
  • Volume discounts at 10M+ predictions

Break-even analysis:

Traditional (4x A100):

  • Cost: $4.40/hour = $3,168/month
  • Covers: ~10.5M predictions

Serverless:

  • Cost: $0.0003 Ă— predictions
  • 10.5M predictions = $3,150/month

Below 10M predictions/month: Serverless cheaper
Above 10M predictions/month: Dedicated GPUs cheaper

Strategy 3: Caching and deduplication

Case study from session: Customer support chatbot

Analysis showed:

  • 40% of queries are similar/duplicate
  • Cache responses for common questions
  • Only 60% hit the model

Before caching:

  • 10M queries/month
  • All hit model
  • Cost: $0.0003 Ă— 10M = $3,000

After caching:

  • 6M unique queries hit model
  • 4M served from cache ($0.00001/query)
  • Cost: $0.0003 Ă— 6M + $0.00001 Ă— 4M = $1,800 + $40 = $1,840

Savings: 39%

Implementation: Redis cache with embedding similarity search (costs $200/month, pays for itself 5x)

Data Storage Optimization

Problem: Training datasets and checkpoints are HUGE and companies keep everything forever.

Databricks customer audit:

Average ML team storage:

  • Raw training data: 150TB (kept forever)
  • Processed datasets: 80TB (many duplicates)
  • Model checkpoints: 120TB (most never used again)
  • Experiment artifacts: 50TB (debugging data)
  • Total: 400TB

Cost at $0.023/GB/month (S3): $9,200/month = $110K/year

Optimization strategy:

  1. Tiered storage:

    • Hot data (active experiments): S3 Standard
    • Warm data (recent models): S3 Glacier Instant
    • Cold data (compliance/archive): S3 Glacier Deep Archive
  2. Retention policies:

    • Raw data: 2 years then archive
    • Checkpoints: Keep final + last 3, delete rest
    • Failed experiments: Delete after 90 days

After optimization:

  • Hot data (S3 Standard): 40TB Ă— $0.023 = $920/month
  • Warm (Instant Retrieval): 60TB Ă— $0.004 = $240/month
  • Cold (Deep Archive): 100TB Ă— $0.00099 = $99/month
  • Deleted: 200TB = $0

New cost: $1,259/month (was $9,200)

Savings: 86% = $95K/year

Data Movement Costs (The Hidden Killer)

Snowflake engineer: “Data egress bankrupts companies and they don’t realize it until the bill comes.”

Real incident:

ML team training in us-west-2:

  • Training data in us-east-1
  • Transferred 50TB per training run
  • 20 training runs/month
  • Data transfer: 1,000TB/month

AWS data transfer out pricing:

  • First 100TB: $0.09/GB = $9,000
  • Next 900TB: $0.085/GB = $76,500
  • Total: $85,500/month

Just for moving data between regions!

Fix:

  • Replicate training data to us-west-2 once: $4,500
  • All subsequent training: No transfer cost
  • Savings: $81K/month

Databricks recommendation: “Colocate compute and data. Data transfer should be <1% of your bill, not 40%.”

The FinOps Metrics That Matter

Session introduced ML FinOps metrics:

1. Cost per training run

  • Track: Total cost / training job
  • Benchmark: Trend over time (should decrease with optimization)

2. Cost per inference

  • Track: Total inference cost / number of predictions
  • Target: <$0.0005 for most models

3. GPU utilization

  • Track: Actual compute time / provisioned time
  • Target: >80% for reserved, >95% for on-demand

4. Training efficiency

  • Track: Model accuracy / total training cost
  • Optimize: Best accuracy per dollar

5. Inference efficiency

  • Track: Predictions per dollar
  • Optimize: Maximize throughput per GPU

Databricks Unified Analytics Platform Pitch

They claimed:

“Customers moving from DIY ML infrastructure to Databricks save 40-60% on total costs.”

Breakdown:

  • No need for ML infrastructure engineers: Save $300K-600K/year in salaries
  • Automated optimization (spot instances, autoscaling): Save 40% on compute
  • Unified platform (less data movement): Save 60% on networking
  • Serverless inference: Save 30-50% on inference

Tradeoff: Lock-in to Databricks platform

My take: Worth it for most companies. Building this in-house is expensive.

My Action Items

  1. Audit our current spend by category (training vs inference vs storage)
  2. Implement model quantization for our 3 largest inference workloads (estimate 60% savings)
  3. Move to tiered storage for training data and checkpoints (estimate 80% savings on storage)
  4. Colocate data and compute (we’re paying $12K/month in transfer, should be $0)
  5. Evaluate Databricks serverless for low-volume models (<10M predictions/month)

Estimated total savings: $200K-300K/year

This session was incredibly valuable. Highly recommend for anyone managing ML infrastructure budgets.

Keisha :chart_decreasing:

Reporting from SF Tech Week - Databricks “Production ML Infrastructure” workshop

Sources: