Skip to main content

Pre-Commit Hooks for Prompts: The Inner-Loop Tooling LLM Teams Keep Shipping Without

· 10 min read
Tian Pan
Software Engineer

Open a prompt file in any production LLM repo and watch the reviewer's eyes glaze over. The diff is fifteen lines of natural language with a tweaked few-shot example, a reworded instruction, and a stray trailing space the editor left behind. There is no syntax check that ran on it, no linter complaining about contradictory instructions, no scanner that noticed the few-shot example contains a real customer's email address from last Tuesday's support trace, and no smoke eval that confirmed the change didn't tank latency on the prompts the system actually serves. The reviewer approves on vibes — the same way teams approved HTML template diffs in 2008 — and then production telemetry catches the regression six hours later.

The inner-loop tooling around code has had two decades to mature. The inner-loop tooling around prompts is somewhere between "we have a .md file in git" and "we ran promptfoo once after onboarding." The gap is widening because prompts are now the higher-leverage edit in many systems: a thirty-line system-prompt change moves more behavior than a thousand-line service rewrite, and it ships through a review process that treats it like a Word document.

This post is about the pre-commit toolchain that has to land before prompt review becomes a discipline rather than a vibe check. Five hooks, in priority order, each catching a specific failure mode that has already burned someone's quarter.

The Formatter: Stop Whitespace From Invalidating Your Cache

Prompt caching is byte-for-byte prefix matching. A single trailing space added by an editor's "trim on save" setting, a markdown heading that got reformatted from ##Section to ## Section, a JSON tool schema whose key order shifted because someone edited the file in a different IDE — any of these invalidates the cached prefix and the next thousand requests pay full token cost instead of cached cost.

The teams that have measured this find cache hit rates collapsing by twenty to forty percent for entire categories of requests, traceable back to a single-character whitespace edit that no human reviewer flagged because no human reviewer can see whitespace in a diff. The fix is mechanical: a prompt-aware formatter that runs in pre-commit, normalizes whitespace, enforces a consistent heading style, sorts JSON tool schema keys deterministically, and refuses to commit a file whose cache-relevant bytes drifted without an explanation.

The formatter should also stabilize cache breakpoints. If your prompt structure has explicit cache markers — cache_control on the system prompt, on the tool definitions, on the first few-shot block — the formatter knows where the cache boundaries are and can warn when a diff straddles one. A change inside a cached region that should have been outside it is the most common source of silent cache-miss surges, and a formatter that treats cache boundaries as a first-class concept catches it before review.

This is the boring hook. It is also the one with the fastest payback, because the cost of running it is microseconds and the cost of missing a cache invalidation is dollars per hour, every hour, until someone notices the bill.

The Linter: Contradictions, Drift, and Instruction Sprawl

Code linters catch unused variables, contradictory type annotations, and patterns the team has agreed not to use. Prompt linters need to catch the same class of problem in natural language: instructions that contradict each other within a single prompt, instructions in the system prompt that contradict the few-shot examples below them, instructions that duplicate a constraint already imposed by the tool schema, and instructions that drift across the prompt-fragment graph when one fragment changes and a sibling fragment doesn't.

The hard cases are not the obvious ones. "Always respond in JSON" followed three paragraphs later by "Explain your reasoning in plain text" is rare in production because someone catches it manually. The cases that slip through are subtler: a prompt that says "be concise" in the system section and then includes a five-shot example whose answers are five paragraphs long, a prompt that names a tool by an old name in the instructions and the new name in the schema, a prompt that says "do not use the word 'apologize'" while the few-shot examples model exactly that behavior. The model picks up the example, not the instruction, and the team blames the model.

A prompt linter that is useful in 2026 has to be partially LLM-driven. Pure regex catches the trivial drift; an LLM-as-judge running over the prompt with a rubric like "does any instruction contradict another instruction, an example, or the tool schema" catches the rest. Run it in pre-commit at low concurrency and low cost (a small model, on the changed prompt only, with a tight token budget), and let it block the commit when it returns a contradiction with high confidence. The point is not perfect catch rate — the point is to surface the contradictions to the author before review, so the reviewer is no longer doing this work in their head.

The Secret-and-PII Scanner: The Few-Shot Example Problem

Source-code secret scanners look for API key patterns, password literals, AWS access key prefixes. They are tuned for code, not prose. A prompt file has a different threat model: the secret usually isn't an API key — it's a real customer message that an engineer pasted into a few-shot example block during the most recent prompt-tuning session, complete with the customer's email, their internal account ID, and the support context they thought was anonymized but wasn't.

This pattern is endemic to teams that tune prompts against real production traces. The tuning workflow is: copy a failing trace into a scratchpad, edit the few-shot example to demonstrate the desired behavior, paste the edited example back into the prompt file, commit. The customer's data is now in the repo, in git history, and — depending on your prompt-loading architecture — in every request that hits the model. The few-shot example may even end up in the prompt cache, fanned out to every workspace.

The pre-commit scanner that catches this is not a re-skin of gitleaks. It needs prompt-specific heuristics: detect what looks like a real user message (long, conversational, with names and timestamps) inside a few-shot block, detect what looks like a real email or phone number or account ID inside the prompt body rather than inside a placeholder template, detect when a \{\{user_input\}\} template has been replaced with a literal customer message. The scanner can be aggressive about false positives because the cost of a real leak is far higher than the cost of a manual override, and the override flow should require a justification comment that gets logged.

For teams under GDPR or HIPAA scope, this hook is not optional — it is the only realistic way to prevent the failure mode where a well-intentioned engineer commits a real patient encounter into the prompt examples while iterating on a clinical assistant. The lawyers will not accept "we trained the engineers to be careful" as a control. They will accept a pre-commit hook with an audit log.

The Smoke Eval: Block the Commit, Not Just the PR

CI-level eval suites are now common: a prompt change opens a PR, promptfoo or an internal equivalent runs the full eval set, scores are posted as a check, the PR merges if the primary metric improves and no guardrail metric regresses past a threshold. This is good and necessary, and it is also too late.

The author has already context-switched away from the prompt by the time the CI eval finishes. The feedback loop is twenty minutes when the inner loop for code is ten seconds. The author iterates on the next thing, comes back to the eval report, has to rebuild context to understand why the regression happened, and either reverts or layers on a fix that compensates. None of this would happen if a ten-case smoke eval ran in pre-commit before the diff ever became a commit.

The pre-commit smoke eval is a different artifact from the CI eval. It is not the full set — it is a curated slice of ten to twenty cases that exercise the failure modes most likely to be touched by the kind of change this team commonly makes. Editing the refusal policy? The smoke eval includes a refusal-test slice. Tweaking the few-shot examples? The smoke eval includes the cases those examples are meant to teach. The eval runs in under a minute, ideally against a fast model or a deterministic fixture, and blocks the commit if any guardrail case regresses.

Teams that have built this find the inner-loop muscle memory transfers almost immediately. The same engineer who would commit a typo and let CI catch it now waits for the smoke eval because the cycle is short enough that waiting is cheaper than rolling back. The reviewer's job shifts from "did this regress anything obvious" to "is the intent of this change correct," which is what reviewers should be doing anyway.

The Cache-Impact Estimator: The Hook Nobody Has Built Yet

The last hook in the priority stack does not exist as off-the-shelf software in May 2026, but the math for it is straightforward and the payoff is large. A cache-impact estimator takes the prompt diff, looks at the cache structure (which prefixes are cached at which breakpoints), and reports: "this edit invalidates the system-prompt cache breakpoint, which served roughly thirty percent of last week's requests; the cached portion above this edit will still hit; the new system prompt will take seven days to re-warm at current traffic." The output is one paragraph and a percentage.

The estimator catches a specific class of mistake: the engineer who edits a system prompt to fix one bug and doesn't realize their edit is above the cache breakpoint that serves the highest-volume request type. The fix lands, the bug is closed, and the bill for that workload doubles for the next week because every request is paying uncached cost. The hook does not need to block — a warning that says "this edit will invalidate X% of recent cache hits" is enough to make the engineer either restructure the edit to fall after the breakpoint, or move the breakpoint, or accept the cost knowingly rather than discover it on Friday.

The reason this hook hasn't shipped widely yet is that it requires the prompt loader and the pre-commit toolchain to share a model of the cache structure. That coupling is annoying to build the first time. After it ships, it pays back every time someone touches the system prompt, which on most teams is every day.

The Cultural Artifact

The five hooks above are not the goal. The goal is that prompt changes get the same ten-second feedback loop reflex that code changes have. When the inner loop is fast and the tooling is doing the mechanical work, reviewers stop catching typos and cache-busters and start catching the actual question — does this prompt change move the system toward the intent the team wants? That is the question prompt review should be about, and it is the question that does not get asked when the reviewer is busy doing the linter's job.

The teams that ship prompts the way 2008 teams shipped HTML templates are going to keep losing review cycles, cache hit rates, and customer-data audits to the teams that treat the inner loop as the actual lever. The pre-commit hook is not a tooling preference. It is the line between prompt engineering as a discipline and prompt engineering as a folk practice.

References:Let's stay in touch and Follow me for more thoughts and updates