Skip to main content

4 posts tagged with "async"

View all tags

Hot-Path vs. Cold-Path AI: The Architectural Decision That Decides Your p99

· 10 min read
Tian Pan
Software Engineer

Every AI feature you ship makes an architectural choice before it makes a product one: does this model call live inside the user's request, or does it run somewhere the user isn't waiting for it? The choice is usually made by whoever writes the first prototype, never revisited, and silently determines your p99 latency for the rest of the feature's life. When the post-mortem asks why a shipping dashboard became unusable at 10 a.m. every Monday, the answer is almost always that something which should have been cold-path got welded into the hot path — and a model that is fine at p50 becomes catastrophic at p99 when traffic fans out.

The hot-path / cold-path distinction is older than LLMs. CQRS, streaming architectures, lambda architectures — they all draw the same line between "must respond now" and "can arrive eventually." What's different about AI workloads is that the cost of crossing the line in the wrong direction is an order of magnitude higher than it used to be. A synchronous database query that takes 50 ms turning into 200 ms is a regression. A synchronous LLM call that takes 1.2 s at p50 turning into 11 s at p99 is a business decision you didn't know you made.

Structured Concurrency for AI Pipelines: Why asyncio.gather() Isn't Enough

· 9 min read
Tian Pan
Software Engineer

When an LLM returns three tool calls in a single response, the obvious thing is to run them in parallel. You reach for asyncio.gather(), fan the calls out, collect the results, return them to the model. The code works in testing. It works in staging. Six weeks into production, you start noticing your application holding open HTTP connections it should have released. Token quota is draining faster than usage metrics suggest. Occasionally, a tool that sends an email fires twice.

The underlying issue is not the LLM or the tool — it's the concurrency primitive. asyncio.gather() was not designed for the failure modes that multi-step agent pipelines produce, and using it as the backbone of parallel tool execution creates problems that are invisible until they compound.

Async Agent Workflows: Designing for Long-Running Tasks

· 10 min read
Tian Pan
Software Engineer

Most AI agent demos run inside a single HTTP request. The user sends a message, the agent reasons for a few seconds, the response comes back. Clean, simple, comprehensible. Then someone asks the agent to do something that takes eight minutes — run a test suite, draft a report from twenty web pages, process a batch of documents — and the whole architecture silently falls apart.

The 30-second wall is real. Cloud functions time out. Load balancers kill idle connections. Mobile clients go to sleep. None of the standard agent frameworks document what to do when your task outlives the transport layer. Most of them quietly fail.

Why Long-Running AI Agents Break in Production (And the Infrastructure to Fix It)

· 9 min read
Tian Pan
Software Engineer

Most AI agent demos work beautifully.

They run in under 30 seconds, hit three tools, and return a clean result. Then someone asks the agent to do something that actually matters — cross-reference a codebase, run a multi-stage data pipeline, process a batch of documents — and the whole thing falls apart in a cascade of timeouts, partial state, and duplicate side effects.

The problem is not the model. It is the infrastructure. Agents that run for minutes or hours face a completely different class of systems problems than agents that finish in seconds, and most teams hit this wall at the worst possible time: after they have already shipped something users depend on.