Skip to main content

3 posts tagged with "self-hosting"

View all tags

The Build-vs-Buy LLM Infrastructure Decision Most Teams Get Wrong

· 10 min read
Tian Pan
Software Engineer

A FinTech team built their AI chatbot on GPT-4o. Month one: $15K. Month two: $35K. Month three: $60K. Projecting $700K annually, they panicked and decided to self-host. Six months and one burned-out engineer later, they were spending $85K/month on infrastructure, a part-time DevOps engineer, and three CUDA incidents that took down production. They eventually landed at $8K/month — but not by self-hosting everything. By routing intelligently.

Both decisions were wrong. The real failure was that they never ran the actual math.

Open-Weight Models in Production: When Self-Hosting Actually Beats the API

· 8 min read
Tian Pan
Software Engineer

Every few months, someone on your team forwards a blog post about how Llama or Qwen "matches GPT-4" on some benchmark, followed by the inevitable question: "Why are we paying for API calls when we could just run this ourselves?" The math looks compelling on a napkin. The reality is that most teams who attempt self-hosting end up spending more than they saved, not because the models are bad, but because they underestimated everything that isn't the model.

That said, there are specific situations where self-hosting open-weight models is the clearly correct decision. The trick is knowing which situation you're actually in, rather than the one you wish you were in.

Self-Hosted LLMs in Production: The GPU Memory Math Nobody Tells You

· 10 min read
Tian Pan
Software Engineer

Most engineers who decide to self-host an LLM start with the same calculation: the model is 70B parameters, FP16 is 2 bytes per parameter, so that's 140 GB. They check that two A100-80GB GPUs fit 160 GB, feel satisfied, and order the hardware. Then they hit production and discover they've already run out of memory before serving a single real user.

The model weights are only part of the story. The piece that surprises almost every team is the KV cache — and understanding it changes every decision you make, from quantization choice to serving framework to how many GPUs you actually need.