Skip to main content

3 posts tagged with "llm-architecture"

View all tags

The Latency Budget Negotiation: How to Tell Product That 'Real-Time' Costs Capability

· 11 min read
Tian Pan
Software Engineer

A product manager walks into a planning meeting with a one-line requirement: "responses under two seconds, like ChatGPT." The agent under discussion makes six tool calls, hits two retrieval indexes, runs a reasoning model with a thinking budget, and validates its output with a second-pass critic. End-to-end p50 is currently nine seconds. The engineering team has three options: say yes and quietly degrade the agent into something worse, say no and watch the PM go shopping for a vendor whose demo video promises the moon, or do the thing nobody teaches in onboarding — open a structured negotiation where every second of latency is convertible to a capability the agent gives up.

Most teams pick option one. The agent ships at two seconds, accuracy drops twelve points, the launch is called a success because the headline latency number was met, and three months later the team is fighting a quality regression that nobody can attribute to a single change because the regression was the launch itself. The latency target was never priced. It was inherited from a product spec that treated speed as free.

Your Tool Catalog Is a Power Law and You're Optimizing the Long Tail

· 11 min read
Tian Pan
Software Engineer

Pull a week of tool-call traces from any production agent and the shape is the same: three or four tools handle 90% of the calls, and a couple of dozen others split the remaining 10%. The catalog is a power law, but the framework treats it like a uniform list. Every tool description ships in every system prompt, every selection rubric weights tools equally, every eval samples the catalog as if a search-files call and a refund-issue call were drawn from the same distribution. They are not.

The cost of that flatness is invisible until it isn't. A team adds the eighteenth tool, the planner's accuracy on the original three drops two points, nobody can localize the regression to a specific change because everything moved at once, and the eval suite — itself uniform across the catalog — averages the slip into a number that still looks fine. Meanwhile the tokens spent describing tools the model will not call this turn now exceed the tokens spent on the user's actual prompt.

The 10x Prompt Engineer Myth: Why System Design Beats Prompt Wordsmithing

· 8 min read
Tian Pan
Software Engineer

There is a persistent belief in the AI engineering world that the difference between a mediocre LLM application and a great one comes down to prompt craftsmanship. Teams hire "prompt engineers," run dozens of A/B tests on phrasing, and spend weeks agonizing over whether "You must" outperforms "Please ensure." Meanwhile, the retrieval pipeline feeds garbage context, there is no output validation, and the error handling strategy is "hope the model gets it right."

The data tells a different story. The first five hours of prompt work on a typical LLM application yield roughly a 35% improvement. The next twenty hours deliver 5%. The next forty hours? About 1%. Teams that recognize this curve early and redirect effort into system design consistently outperform teams that keep polishing prompts.