Skip to main content

3 posts tagged with "scheduling"

View all tags

The GPU Reservation Your Batch Workload Starved Your Real-Time Path On

· 9 min read
Tian Pan
Software Engineer

The nightly fine-tune job starts at 02:00 UTC. It walks into the shared GPU pool, takes every slot it can find, and holds them. By 09:30, when the first inference traffic of the business day arrives, the autoscaler tries to claim capacity that has been continuously occupied for seven and a half hours. The first ninety minutes of the morning run at roughly four times the baseline p99 latency. The dashboard reports a "noisy morning tail" that the inference team attributes to user behavior, because the actual contention lives in a job queue nobody on the inference team owns.

This is the GPU-sharing failure mode that the cost-attribution slide in your capacity review does not capture. The sharing was sold as a utilization win — train at night, serve in the day, fill the trough. What actually shipped was a latency tail you cannot escape until the pool is partitioned by latency class, not by team or by clock.

Your Scheduled Agent Has Four Clocks, and You Are Trusting the Wrong One

· 12 min read
Tian Pan
Software Engineer

A daily standup summary is scheduled for 09:00 UTC. The cron fires on time. A worker pod spins up two seconds later. The LLM call takes another forty seconds round-trip. The model writes its summary believing it is February of last year, because that is the last thing its training data confidently knew. The tool layer dispatches the Slack message against the wall clock at 09:00:42 UTC, on a date the model never mentions because nobody asked it to. The message lands in the right channel, with yesterday's standup notes summarized as "today's," and nobody notices for three weeks.

This is not a bug in any one component. It is a contract that nobody wrote between four different clocks that all believe they know what "now" is.

Agent Fleet Concurrency: Coordinating Dozens of Agents Without Deadlock or the Thundering Herd

· 12 min read
Tian Pan
Software Engineer

Eleven agents started at the same second. Three died before the first tool call returned. That 27% fatality rate was not a model problem, a prompt problem, or a tool problem. It was a scheduling problem — the same kind of problem an operating system solves when fifty processes wake up at once and fight over a single CPU. The difference is that the OS has forty years of accumulated wisdom and the agent runtime has about two.

Anyone who has wired up more than a handful of concurrent LLM workers has seen some version of this. You kick off a scheduled job at 02:00, thirty agents spin up, they all hit the same provider within 200 ms of each other, and most of them fail with a mix of 429s, 502s, and connection resets. The survivors get half the rate budget they were promised because the provider's fair-share logic has already started throttling your API key. By 02:05 the surviving agents finish and your dashboard shows a completion rate that would embarrass a first-year CS student writing their first producer-consumer. Your on-call rotation debates whether to add retries, add a queue, or just run fewer of them.

None of those are the right answer by themselves. The right answer is that a fleet of agents is a small distributed system and needs to be designed like one.