Skip to main content

4 posts tagged with "llm-architecture"

View all tags

The Conversation Tree Your Server Stored As A Log

· 10 min read
Tian Pan
Software Engineer

A user types "actually, I meant fifty, not fifteen," hits the pencil icon on their last message, and edits it. The UI does what good UIs do: it shows them the corrected message, fades out the old one, scrolls the assistant's stale reply into a struck-through ghost, and presents a clean conversation that reads as if the original mistake never happened. The user, satisfied, sends the next turn. The agent answers using fifteen.

The bug is not in the model. The model received exactly what the server sent it, and the server sent it the original message, the original assistant response, the regret, the edited message, and the new request — all concatenated, all in order, all live. The user is having a conversation they edited. The agent is having a conversation that was never edited. The two transcripts diverge at turn three and never reconcile, and every subsequent turn pays interest on the gap.

The Latency Budget Negotiation: How to Tell Product That 'Real-Time' Costs Capability

· 11 min read
Tian Pan
Software Engineer

A product manager walks into a planning meeting with a one-line requirement: "responses under two seconds, like ChatGPT." The agent under discussion makes six tool calls, hits two retrieval indexes, runs a reasoning model with a thinking budget, and validates its output with a second-pass critic. End-to-end p50 is currently nine seconds. The engineering team has three options: say yes and quietly degrade the agent into something worse, say no and watch the PM go shopping for a vendor whose demo video promises the moon, or do the thing nobody teaches in onboarding — open a structured negotiation where every second of latency is convertible to a capability the agent gives up.

Most teams pick option one. The agent ships at two seconds, accuracy drops twelve points, the launch is called a success because the headline latency number was met, and three months later the team is fighting a quality regression that nobody can attribute to a single change because the regression was the launch itself. The latency target was never priced. It was inherited from a product spec that treated speed as free.

Your Tool Catalog Is a Power Law and You're Optimizing the Long Tail

· 11 min read
Tian Pan
Software Engineer

Pull a week of tool-call traces from any production agent and the shape is the same: three or four tools handle 90% of the calls, and a couple of dozen others split the remaining 10%. The catalog is a power law, but the framework treats it like a uniform list. Every tool description ships in every system prompt, every selection rubric weights tools equally, every eval samples the catalog as if a search-files call and a refund-issue call were drawn from the same distribution. They are not.

The cost of that flatness is invisible until it isn't. A team adds the eighteenth tool, the planner's accuracy on the original three drops two points, nobody can localize the regression to a specific change because everything moved at once, and the eval suite — itself uniform across the catalog — averages the slip into a number that still looks fine. Meanwhile the tokens spent describing tools the model will not call this turn now exceed the tokens spent on the user's actual prompt.

The 10x Prompt Engineer Myth: Why System Design Beats Prompt Wordsmithing

· 8 min read
Tian Pan
Software Engineer

There is a persistent belief in the AI engineering world that the difference between a mediocre LLM application and a great one comes down to prompt craftsmanship. Teams hire "prompt engineers," run dozens of A/B tests on phrasing, and spend weeks agonizing over whether "You must" outperforms "Please ensure." Meanwhile, the retrieval pipeline feeds garbage context, there is no output validation, and the error handling strategy is "hope the model gets it right."

The data tells a different story. The first five hours of prompt work on a typical LLM application yield roughly a 35% improvement. The next twenty hours deliver 5%. The next forty hours? About 1%. Teams that recognize this curve early and redirect effort into system design consistently outperform teams that keep polishing prompts.