Skip to main content

One post tagged with "gpu-serving"

View all tags

MoE Models in Production: The Serving Quirks Dense-Model Benchmarks Hide

· 10 min read
Tian Pan
Software Engineer

Benchmarks told you Mixtral 8x7B costs half as much as a 46B dense model to run. What they didn't tell you is that it needs roughly 8.6× more GPU memory than an equivalent dense model, responds with wildly different latency depending on which token hit which expert, and falls apart at medium batch sizes in ways that take days to diagnose. Mixture-of-Experts architectures have become the backbone of nearly every frontier model — DeepSeek-V3, Llama 4, Gemini 1.5, Grok, Mistral Large — but the serving assumptions that work for dense models break in subtle, expensive ways for MoE.

If you're planning to self-host or route traffic to any of these models, here's what dense-model intuition gets wrong.