Skip to main content

Capability Probing: How to Map Your Model's Limitations Before Users Do

· 10 min read
Tian Pan
Software Engineer

Most teams discover their model's limitations the same way users do — in production, through support tickets. A customer reports the extraction pipeline silently dropping nested addresses. An internal user notices the summarizer hallucinating dates past 8,000 tokens. A compliance review finds the classifier confidently labeling ambiguous cases instead of abstaining.

None of these are surprises. They are capability boundaries that were always there, waiting for the right input to expose them. You either map those boundaries before deployment, or your users map them for you — one incident at a time.

The difference is cost: a probe failure in CI costs you five minutes. A capability gap discovered in production costs you a customer's trust. The discipline of finding those boundaries systematically is capability probing — fault injection for language models. You wouldn't ship a bridge without load-testing the joints. The same logic applies to any model you put in front of users.

Unlike standard evaluation — which measures average performance on expected tasks — capability probing deliberately targets where things break. You stress specific capabilities under controlled conditions, so failures surface in your test suite instead of your customer's workflow.

Why Benchmarks Give You a False Sense of Security

MMLU, HumanEval, and MATH scores tell you about general reasoning across standardized tasks. They say nothing about whether your model can parse the date formats in your invoices, handle your domain's abbreviations, or maintain accuracy when users switch languages mid-document.

A model scoring 90% on a general extraction benchmark might hit 40% on your specific document layout — tables with merged cells, footnotes spanning pages, headers in a non-Latin script. That gap isn't a model defect. It's a mismatch between what benchmarks measure and what your pipeline actually needs.

Benchmarks optimize for breadth. Production demands depth on a narrow task distribution. Failures cluster at the intersection of specific capabilities and unusual inputs — exactly the region benchmarks skip. Until you probe against your actual data distribution, benchmark scores are noise dressed as confidence.

Building a Capability Probe Suite

A probe suite is a structured collection of test cases designed to find failure boundaries rather than confirm expected behavior. Building one requires thinking adversarially about your own use case.

Start from failures, not successes. The best probe cases come from real production incidents, support tickets, and manual QA sessions. If a user reported that the model mishandled a specific input, that input belongs in your probe suite permanently. Aim for 20–50 failure cases before writing any evaluation infrastructure — small enough to curate carefully, large enough to reveal patterns across input types.

Decompose your task into atomic capabilities. A "document summarizer" actually requires several distinct capabilities: extracting key claims, preserving numerical precision, handling multi-section documents, recognizing when information is missing, and maintaining factual consistency. Each capability is a separate dimension to probe.

When you decompose this way, you often find that 80% of failures cluster in one or two capabilities you never explicitly tested.

Test both positive and negative cases. For every capability, probe what the model should do and what it should not do. If your classifier should abstain on ambiguous inputs, test that it actually abstains — not just that it correctly classifies clear-cut cases. Unidirectional probes create a blind spot: the model scores well by being overconfident, because you never tested its ability to say "I don't know."

Vary input characteristics systematically. For each capability, create test cases that vary along the dimensions most likely to cause degradation:

  • Input length (short, medium, at the context window boundary)
  • Language and script variations
  • Formatting irregularities (missing fields, unexpected delimiters, mixed encodings)
  • Domain-specific jargon and abbreviations
  • Adversarial phrasings that test robustness

The goal is not exhaustive coverage — it is finding the cliff edges where performance drops sharply.

The Capability Matrix: Making Boundaries Visible

Once you have probe results, organize them into a capability matrix. This is a grid where rows are specific capabilities and columns are input conditions. Each cell contains the model's pass rate for that combination.

A capability matrix for a document extraction system might look like this:

CapabilityStandard FormatScanned PDFHandwrittenMulti-language
Name extraction97%82%41%73%
Date parsing94%78%35%68%
Amount extraction96%85%29%71%
Address parsing88%61%22%54%
Missing field detection72%48%31%44%

This matrix makes several things immediately visible:

  • Address parsing underperforms other extractions across the board
  • Handwritten documents are a near-total failure mode
  • Missing field detection — knowing when data is absent — is the weakest capability even under ideal conditions
  • Combining multiple degradation factors (e.g., handwritten + multi-language) will likely fail completely

Without the matrix, these boundaries stay invisible. The system handles 80% of traffic fine, while the remaining 20% generates a steady stream of silent errors that take months to surface.

A capability matrix is a snapshot, though. Provider updates, prompt tweaks, and retrieval pipeline changes can all shift boundaries silently — which is where canary prompts come in.

Canary Prompts: Catching Drift Between Deployments

Static probe suites catch issues at deploy time. But a capability that worked last week can silently degrade after what seemed like an unrelated change. Canary prompts extend your coverage into the gaps between deployments.

Canary prompts are fixed, unchanging inputs that you run against your model on a schedule — hourly, daily, or on every deployment. Each canary targets a specific capability boundary you have already mapped. When a canary's output drifts, something in your pipeline shifted — and you investigate before users notice.

Effective canaries share three properties:

  • They target capabilities near the failure boundary, where even small changes in model behavior cause a detectable shift.
  • They have deterministic expected outputs, so you can use exact-match or simple semantic comparison rather than LLM-as-judge evaluation.
  • They are cheap to run — your entire canary suite should execute in under a minute.

A practical canary set for the extraction system above might include:

  • A scanned PDF with a known tricky address format (testing the 61% capability)
  • A document with a deliberately missing field (testing the 72% detection rate)
  • A bilingual invoice (testing the multi-language column)

If any of these canaries flip from pass to fail after a deployment, you have an early warning far more actionable than aggregate metrics.

The Probe-to-Regression Pipeline

Probes and regression tests serve different purposes, but they have a natural lifecycle relationship.

A new probe starts as exploration — you are looking for failures you did not know existed. When you find a failure and fix it (through prompt changes, retrieval improvements, or guardrails), that probe graduates into a regression test. It transitions from "can the model handle this?" to "does the model still handle this?"

This graduation matters because the two collections serve opposite goals. Probes should have low pass rates — if everything passes, you are not probing aggressively enough. Regression tests should have near-100% pass rates — if a previously fixed failure recurs, something broke.

Maintain both collections side by side:

  • Active probes: Cases where you know the model fails or is unreliable. These guide improvement priorities. Expect pass rates of 30–70%.
  • Regression suite: Cases where the model previously failed but now succeeds after changes. These guard against regressions. Target pass rates above 95%.

When active probes reach saturation — consistently high pass rates — either the issues are genuinely fixed or your probes have gone stale. Retire the easy wins into the regression suite, add harder variants, and keep pushing the boundary outward.

This lifecycle turns a one-time evaluation effort into a durable quality practice. Probes without graduation become stale. Regression suites without fresh probes become complacent. The two collections feed each other — and the system gets sharper with every iteration.

Grading Probes Without Ground Truth

Canaries and regression tests assume you know the correct answer. But what about probing new capabilities where no labeled data exists?

A common objection is that probing requires ground truth for every case. It doesn't — several grading strategies work well without labels:

  • Consistency checks: Run the same input multiple times. High variance signals uncertainty — you are sitting on a capability boundary.
  • Perturbation testing: Make small input changes that should not alter the output. If rephrasing a question or adding irrelevant context shifts the answer, the model is at its reliability limit.
  • Constraint verification: Check structural properties without judging content. Does the JSON parse? Are required fields present? Does the date fall in a plausible range? Mechanical checks catch a surprising number of failures.
  • Self-consistency probing: Ask the same question through different framings. Contradictions mark a boundary.

These approaches don't replace human evaluation for high-stakes decisions. But they let you probe at scale and focus expensive human review on the exact boundaries where automated checks flag ambiguity.

Pre-Deployment Probe Checklist

Before deploying a model to production or upgrading to a new version, run through this checklist:

  • Boundary probes: Test every known capability boundary from your matrix. Any degradation is a deployment blocker until investigated.
  • Canary suite: Run all canary prompts and compare to stored baselines. Flag any semantic changes — the question is not whether the new output is good, but whether it changed.
  • Format stability: If your system parses model output (JSON, XML, structured text), verify format consistency under the new version. Format changes are the most common silent regression in model upgrades.
  • Refusal behavior: Test edge cases for refusal. Model upgrades frequently shift refusal thresholds — a query that used to work might now be refused, or vice versa.
  • Latency and token usage: Measure token counts for the same inputs. Changes indicate behavioral shifts even when outputs look similar.

This checklist takes hours, not weeks. Compare that to the investigation time, emergency patches, and eroded trust that follow from discovering regressions in production.

From Probing to Product Decisions

The highest-leverage outcome of capability probing isn't catching bugs — it's informing product decisions with data instead of intuition.

Teams that ship reliable AI products treat capability boundaries as first-class product constraints. In practice, this looks like:

  • A living capability matrix updated with every model change, not just at deployment
  • Canary prompts running continuously in production, not just in CI
  • User-reported failures routed back into the probe suite within days, not quarters

Most importantly, they use probing results to make product decisions, not just engineering decisions. When the capability matrix shows 41% accuracy on handwritten inputs, the product decision might be to exclude that input type, add a human review step, or set user expectations with a confidence indicator. These decisions are impossible without knowing where the boundaries are.

The alternative — discovering boundaries through user complaints — is slower, more expensive, and erodes trust that is hard to rebuild. A single confident wrong answer makes users doubt every future output, even the correct ones.

Map the limitations first, then design accordingly. The boundaries don't disappear — but they transform from invisible risks into engineering constraints you can reason about, communicate, and build around. The teams shipping the most reliable AI products aren't the ones with the best models. They're the ones who know exactly where their models break — and have designed their systems around that knowledge.

Start small: take your last three production incidents, turn them into probes, and build from there.

References:Let's stay in touch and Follow me for more thoughts and updates