Building on the edge architecture discussion, I want to focus specifically on AI workloads at the edge - which is driving much of the market growth Rachel documented.
Computer Vision: The Killer App for Edge AI
Computer vision remains the dominant edge AI use case in 2026, and for good reason:
Why Computer Vision Fits Edge Perfectly:
- High bandwidth requirements (4K video = ~100Mbps)
- Low latency needs (quality control decisions in real-time)
- Privacy benefits (video doesn’t leave premises)
- Cost savings (uploading video to cloud is expensive)
Real-World Applications:
- Manufacturing Quality Control - Inspect products at production speed, reject defects instantly
- Retail Analytics - Customer flow optimization, inventory monitoring, theft detection
- Healthcare - Medical imaging analysis in operating rooms, diagnostics at point of care
- Smart Cities - Traffic management, parking optimization, public safety monitoring
The LLM Challenge: Square Peg, Round Hole?
Now for the hard truth: deploying LLMs at the edge is brutal. At my startup, we’ve learned this the expensive way.
The Size Problem
Cloud LLM:
- GPT-4 scale: 1.76 trillion parameters
- Requires: 8x A100 GPUs (80GB each)
- Inference: 100-500ms, but with massive compute
Edge Constraints:
- Available GPU memory: 4-16GB typically
- Power budget: 15-75W (vs 300-400W datacenter GPUs)
- Cost: $500-2000 per device
You literally cannot fit full-scale LLMs on edge hardware. Period.
Our Solutions: Model Compression
We’ve had to get creative:
1. Quantization - Reduce model precision
- FP32 → INT8: 4x size reduction, ~3% accuracy loss
- INT8 → INT4: 8x reduction, ~8-12% accuracy loss
We’re running 7B parameter models quantized to INT4, fitting in 3.5GB.
2. Distillation - Train smaller models to mimic larger ones
- Teacher model (70B params) → Student model (7B params)
- Retain 85-90% of capability at 10x smaller size
3. Specialized SLMs (Small Language Models)
- Domain-specific models: finance, healthcare, legal
- 1-7B parameters, optimized for specific tasks
- Better than general LLMs for narrow use cases
Autonomous Vehicles: The Ultimate Edge AI Test
The automotive industry’s shift from SAE Level 2+ to Level 3 is huge:
Level 2: Driver responsible, can take hands off wheel briefly
Level 3: Vehicle responsible in specific conditions
This isn’t just technical - it’s a liability shift from driver to OEM. If the car crashes while autonomous, the manufacturer is liable.
This demands:
- 99.9999% reliability (six nines)
- Zero cloud dependency for safety decisions
- Real-time sensor fusion (cameras, lidar, radar)
- Redundant processing (multiple edge compute units)
Tesla, Waymo, Cruise are all deploying massive edge AI systems - custom silicon (Tesla FSD chip), distributed processing, offline operation.
This is edge AI at its most demanding, and it’s driving hardware innovation.
Trade-offs I’ve Made
Here’s what we’ve learned deploying LLMs at edge:
What Works:
- Retrieval-augmented generation (RAG) with local vector DB
- Streaming responses (feels faster even if it isn’t)
- Hybrid: simple queries at edge, complex queries to cloud
- Specialized models per use case
What Doesn’t Work:
- Trying to match cloud LLM quality
- Zero-shot learning (edge models need fine-tuning)
- Frequent model updates (deployment overhead too high)
- General-purpose LLMs (too big, too slow)
Energy Efficiency: The Hidden Constraint
Edge devices are power-constrained in ways cloud datacenters aren’t:
Cloud: “Need more power? Add more servers.”
Edge: “This device has a 45W power budget, period.”
For automotive, power comes from the vehicle battery. For IoT, from solar/battery. For retail, from local power with cooling constraints.
This forces architectural decisions:
- Model compression (smaller = less energy)
- Quantization (INT8 uses 4x less energy than FP32)
- Sparse models (only activate needed pathways)
- Custom silicon (ASICs more efficient than GPUs)
Google’s TPUs, Apple’s Neural Engine, Tesla’s FSD chip - all designed for efficient edge inference.
The Question: Will Edge LLMs Ever Match Cloud?
My controversial take: No, and that’s okay.
Edge and cloud serve different purposes:
Cloud LLMs: General purpose, maximum capability, high cost
Edge LLMs: Specialized, low latency, cost-effective at scale
The future is hybrid:
- Edge handles 90% of queries (fast, cheap, private)
- Cloud handles 10% complex queries (slow, expensive, powerful)
- Seamless handoff between them
This matches user expectations: most interactions are simple (“set timer for 10 minutes”), some are complex (“explain quantum entanglement”).
What are others seeing with edge AI deployments? Where is LLM compression good enough vs where do you need cloud?