The 200-Token System Prompt That Beats Your 4000-Token One
A team I worked with spent six months tuning a system prompt to roughly 4,000 tokens. It was their crown jewel — a careful accretion of edge-case handling, formatting rules, persona instructions, fallback behaviors, and a dozen few-shot examples. Then a junior engineer joined, asked why the prompt was so long, and rewrote it in an afternoon. The new version was 200 tokens. On their existing eval suite it scored four points higher. It was also forty times cheaper to run, and noticeably faster.
This is not an anecdote about a magic short prompt. It is a pattern I see almost every time I read a production system prompt that has lived past its first quarter. Long prompts grow by accretion, not by design. Every failure mode that surfaced in QA contributed a paragraph. Every stakeholder who watched a demo contributed a tone instruction. Every example that "seemed to help" got pinned to the bottom. The result is a prompt that is longer than the user input it is meant to instruct, full of internal contradictions the model has to silently resolve at inference time, with attention spread thinly across competing demands.
