Skip to main content

10 posts tagged with "performance"

View all tags

The Cold Start Tax on Serverless AI Agents

· 11 min read
Tian Pan
Software Engineer

A standard Lambda function with a thin Python handler cold-starts in about 250ms. Your AI agent, calling the same runtime with a few SDK imports added, cold-starts in 8–12 seconds. Add local model inference and you're at 40–120 seconds. The first user to hit a scaled-down deployment waits the length of a TV commercial before the agent responds. That gap — not latency per inference token, not throughput, but the initial startup cost — is where most serverless AI deployments quietly fail their users.

The problem isn't unique to serverless, but serverless makes it visible. When you run agents on always-on infrastructure, you pay for idle capacity and cold starts never happen. When you embrace scale-to-zero to cut costs, every period of low traffic becomes a trap waiting for the next request.

The N+1 Query Problem Has Infected Your AI Agent

· 10 min read
Tian Pan
Software Engineer

Your AI agent just made twelve API calls to answer a question that needed two. You didn't notice because there's no EXPLAIN ANALYZE for tool calls, no ORM profiler flagging the issue, and the agent got the right answer anyway — just two seconds late and three times over-budget on tokens.

This is the N+1 query problem, and it has quietly migrated from your database layer into your agent's tool call layer. The bad news: the failure mode is identical to what poisoned web applications in the 2010s. The good news: the solutions from that era port almost directly.

The Context Stuffing Antipattern: Why More Context Makes LLMs Worse

· 9 min read
Tian Pan
Software Engineer

When 1M-token context windows shipped, many teams took it as permission to stop thinking about context design. The reasoning was intuitive: if the model can see everything, just give it everything. Dump the document. Pass the full conversation history. Forward every tool output to the next agent call. Let the model sort it out.

This is the context stuffing antipattern, and it produces a characteristic failure mode: systems that work fine in early demos, then hit a reliability ceiling in production that no amount of prompt tweaking seems to fix. Accuracy degrades on questions that should be straightforward. Answers become hedged and non-committal. Agents start hallucinating joins between documents that aren't related. The model "saw" all the right information — it just couldn't find it.

Continuous Batching: The Single Biggest GPU Utilization Unlock for LLM Serving

· 11 min read
Tian Pan
Software Engineer

Most LLM serving infrastructure failures in production aren't model failures—they're scheduling failures. Teams stand up a capable model, load test it, and discover they're burning expensive GPU time at 35% utilization while users wait. The culprit is almost always static batching: a default inherited from conventional deep learning that fundamentally doesn't fit how language models generate text.

Continuous batching—also called iteration-level scheduling or in-flight batching—is the mechanism that fixes this. It's not a tuning knob; it's an architectural change to how the serving loop runs. The difference between a system using it and one that isn't can be 4–8x in throughput for the same hardware.

Semantic Caching for LLM Applications: What the Benchmarks Don't Tell You

· 8 min read
Tian Pan
Software Engineer

Every vendor selling an LLM gateway will show you a slide with "95% cache hit rate." What that slide won't show you is the fine print: that number refers to match accuracy when a hit is found, not how often a hit is found in the first place. Real production systems see 20–45% hit rates — and that gap between marketing and reality is where most teams get burned.

Semantic caching is a genuinely useful technique. But deploying it without understanding its failure modes is how you end up returning wrong answers to users with high confidence, wondering why your support queue doubled.

Speculative Execution in AI Pipelines: Cutting Latency by Betting on the Future

· 11 min read
Tian Pan
Software Engineer

Most LLM pipelines are embarrassingly sequential by accident. An agent calls a weather API, waits 300ms, calls a calendar API, waits another 300ms, calls a traffic API, waits again — then finally synthesizes an answer. That 900ms of total latency could have been 300ms if those three calls had run in parallel. Nobody designed the system to be sequential; it just fell out naturally from writing async calls one after another.

Speculative execution is the umbrella term for a family of techniques that cut perceived latency by doing work before you know you need it — running parallel hypotheses, pre-fetching likely next steps, and generating multiple candidate outputs simultaneously. These techniques borrow directly from CPU design, where processors have speculatively executed future instructions since the 1990s. Applied to AI pipelines, the same instinct — commit to likely outcomes, cancel the losers, accept the occasional waste — can produce dramatic speedups. But the coordination overhead can also swallow the gains whole if you're not careful about when to apply them.

Load Testing LLM Applications: Why k6 and Locust Lie to You

· 11 min read
Tian Pan
Software Engineer

You ran your load test. k6 reported 200ms average latency, 99th percentile under 800ms, zero errors at 50 concurrent users. You shipped to production. Within a week, users were reporting 8-second hangs, dropped connections, and token budget exhaustion mid-stream. What happened?

The test passed because you measured the wrong things. Conventional load testing tools were designed for stateless HTTP endpoints that return a complete response in milliseconds. LLM APIs behave like nothing those tools were built to model: they stream tokens over seconds, charge by the token rather than the request, saturate GPU memory rather than CPU threads, and respond completely differently depending on whether a cache is warm. A k6 script that hammer-tests your /chat/completions endpoint will produce numbers that look like performance data but contain almost no signal about what production actually looks like.

LLM Latency Decomposition: Why TTFT and Throughput Are Different Problems

· 11 min read
Tian Pan
Software Engineer

Most engineers building on LLMs treat latency as a single dial. They tune something — a batch size, a quantization level, an instance type — observe whether "it got faster," and call it done. This works until you hit production and discover that your p50 TTFT looks fine while your p99 is over 3 seconds, or that the optimization that doubled your throughput somehow made individual users feel the system got slower.

TTFT and throughput are not two ends of the same slider. They are caused by fundamentally different physics, degraded by different bottlenecks, and fixed by different techniques. Treating them as interchangeable is the root cause of most LLM inference incidents I've seen in production.

LLM Latency in Production: What Actually Moves the Needle

· 10 min read
Tian Pan
Software Engineer

Most LLM latency advice falls into one of two failure modes: it focuses on the wrong metric, or it recommends optimizations that are too hardware-specific to apply unless you're running your own inference cluster. If you're building on top of a hosted API or a managed inference provider, a lot of that advice is noise.

This post focuses on what actually moves the needle — techniques that apply whether you control the stack or not, grounded in production data rather than benchmark lab conditions.

Three Skills to Boost Team Performance

· 3 min read

Teamwork is crucial. Even geniuses like Turing need help from others to crack the Enigma. So, what are the key factors that enable a team to succeed? People naturally believe it is the abilities and levels of individual members, but the reality might surprise you.

At the beginning of The Culture Code, the author describes an interesting competition among kindergarten children, business school students, and lawyers: participants had to build the tallest structure possible using raw spaghetti, tape, string, and marshmallows. The competition ended with the kindergarten children winning. Why did the seemingly least capable group manage to defeat the others? Upon reviewing the competition, we found that business school students typically analyzed the problem first, discussed the right strategy, and then quietly established a hierarchy; whereas the kindergarten children simply started building and experimenting with different approaches.

A strong team culture emphasizes communication among team members rather than individual skills. Such a culture maximizes overall performance. To foster a positive team culture that enhances collective performance, there are three key skills.

1. Create a Safe Work Environment

People are more likely to unleash their full potential in a familiar environment, so creating a safe space is crucial. The sense of safety within a team comes from familiarity and connection among its members. If you want to cultivate a safe work environment, it is essential to learn to listen and let others know they are heard. When people know that what they say is being listened to and valued, they feel secure. You can provide appropriate feedback while listening, which enhances interaction and makes people feel needed.

2. Be Vulnerable to Build Trust

Although it may seem counterintuitive, showing vulnerability can indeed enhance team performance. We often observe the behaviors of those around us and learn by imitation. Admitting your weaknesses and mistakes to team members shows that they can do the same. This helps to strengthen internal trust among the team.

At the same time, displaying your shortcomings expresses an expectation for collaboration. When you show that you rely on others for help, they can also comfortably acknowledge their need for assistance. Over time, everyone understands that they shouldn’t bear everything alone, naturally fostering a sense of unity within the team.

3. Establish Common Goals and Reinforce Them

A steadfast pursuit of common goals is key to good team performance. A team's common goal refers to the beliefs and values that motivate the actions of its members. This common goal clarifies the team's self-identity and communicates it to the outside world. Psychologist Gabriele Oettingen has demonstrated through several studies that communicating common goals helps unite members and makes achieving those goals easier.

To deepen members' impressions, repetition is essential. To express things more clearly, repeating them ten or even a hundred times is worthwhile. You can repeatedly convey the company's mission in meetings or turn the goals into catchy slogans.