Skip to main content

Agentic Data Pipelines: Offline Enrichment and Classification at Scale

· 9 min read
Tian Pan
Software Engineer

You have a batch job that classifies 10 million customer support tickets overnight. You swap the regex classifier for an LLM and the accuracy jumps from 61% to 89%. Then you ship it and discover: the job now costs 40x more, runs 12x slower, silently skips 3% of records when the model returns unparseable output, and your downstream analytics team is filing bugs because the label schema drifted without anyone noticing.

Agentic data pipelines break in ways that ETL engineers haven't seen before, and the fixes require a different mental model than either traditional batch processing or real-time LLM serving.

The Core Design Shift: Variable Compute Per Record

Traditional ETL has a convenient property: every record costs roughly the same to process. You can predict job duration, size your cluster, and set cost budgets by multiplying record count by per-record cost.

Agentic pipelines destroy this property. When an agent must reason about a record — fetching related context, deciding whether to call a tool, iterating on an extraction — the compute budget per record becomes a distribution, not a constant. Some records resolve in one model call. Others trigger three tool calls, a clarification loop, and a second extraction pass. The same pipeline processing the same schema can cost 2x more on one dataset than another, purely because the data is harder.

This has concrete infrastructure consequences:

  • Job sizing is probabilistic. You cannot estimate runtime without understanding the complexity distribution of your input data. A 10M-record job that completed in 4 hours last month may take 11 hours after you add a reasoning step for ambiguous cases.
  • Cost caps require per-record limits. Without explicit token budgets per record, a small subset of hard records can consume a disproportionate share of a job's cost. Set a max-token ceiling per record; fail gracefully when it is exceeded rather than letting a few outliers blow the budget.
  • Parallelism math changes. Batch throughput depends on GPU utilization across concurrently-processed records. If 5% of records require 10x more compute, you either underutilize resources waiting for long-tailed records or you let them stall the batch.

The practical fix is to segment records by estimated complexity before processing. A lightweight pre-classifier (rules-based or a small, cheap model) routes records into fast/slow lanes. The fast lane runs in parallel at maximum concurrency. The slow lane runs with tighter concurrency controls and explicit cost tracking. This pattern — not just throwing everything at maximum parallelism — is what separates agentic pipelines that run predictably from ones that produce bill shock.

Confidence Thresholds as Operational Contracts

In a traditional classifier, you have a model score and a threshold. Records above the threshold get a label; records below get flagged for review. This is familiar.

In an agentic pipeline, the question is harder: when the agent produces a classification, who owns the decision? The agent or a human reviewer?

The answer should not be "the agent always" or "always route to humans." The answer is a tiered contract you define at pipeline design time:

  • High-confidence, low-stakes records (product category tagging, sentiment classification): auto-commit without review
  • High-confidence, high-stakes records (fraud flags, compliance classifications, medical codes): commit but write to an audit log with mandatory sampling review
  • Low-confidence records regardless of stakes: route to a human review queue

The threshold values are not magic numbers. They should be derived empirically from your specific model and data: run your pipeline on a labeled holdout set, plot precision/recall at different confidence cutoffs, and set thresholds where the tradeoff meets your downstream tolerance. A 95% confidence threshold that achieves 99.5% precision on your specific task is far more meaningful than a generic "use 0.8 confidence" rule.

The organizational trap is treating confidence thresholds as a one-time calibration. Models drift. Data distribution shifts. A threshold set in January may be badly miscalibrated by July because the types of records entering the pipeline have changed. Build re-calibration into your operational cadence, triggered by: a new model version, a data source change, or a quarterly review.

One structural detail that teams often skip: route low-confidence records to a review queue that feeds back into your training set. Human-reviewed records at the margin of the confidence boundary are the highest-value training signal you can get. If you are running any form of fine-tuning or RLHF, this is where the data comes from.

Schema Design for Downstream Consumers

The structured output schema your pipeline produces is a distributed contract. Every downstream consumer — dashboards, APIs, data warehouses, other pipelines — depends on it. And because the schema is generated by an LLM call rather than deterministic code, it fails in ways that traditional schemas don't.

Principle 1: Enum types over free-form strings. If a field can only take a finite set of values, declare it as an enum in your output schema. LLMs will generate plausible-but-wrong free-form labels with confidence. They will rarely violate a well-specified enum. This is one of the highest-ROI schema design decisions you can make.

Principle 2: Required fields should be operationally necessary. The temptation is to mark everything required. The operational problem is that required fields that the model frequently cannot populate create partial-failure chains: the record fails schema validation, gets skipped, and disappears silently. Mark a field required only if a null value in that field would break downstream consumers. Otherwise, optional with a nullable type and an explicit handling rule for nulls.

Principle 3: Treat schema changes as breaking API changes. Adding a required field, removing a field, or changing a field type are breaking changes for every downstream consumer. Version your schemas explicitly. Maintain a changelog. Give consumers a deprecation window before removing fields. This is standard API contract management, but teams often skip it for "internal" data pipelines — until a downstream dashboard breaks and nobody knows why.

Principle 4: Include provenance metadata. Every enriched record should carry fields that describe how it was produced: model version, prompt version, confidence score, processing timestamp, and the source record ID. This is not overhead — it is the audit trail that lets you debug a downstream anomaly, re-process affected records after a model update, and track which records were produced by which pipeline version.

Monitoring: Model Uncertainty vs. Data Ambiguity

The hardest monitoring problem in agentic data pipelines is this: a record returns low confidence. Is the model struggling with a clear record, or is the record genuinely ambiguous?

These two situations look identical from the outside — both produce low confidence scores — but they require different responses:

  • Model uncertainty on clear data is a signal that the model needs retraining, the prompt needs improvement, or the record contains an edge case the model hasn't seen. The fix is on your side.
  • Data ambiguity means the record is inherently unclear — multiple reasonable labels exist, the source data is contradictory, or the task specification doesn't cover this case. The fix may be impossible at the model layer.

If you conflate these, you will route genuinely ambiguous records into a human review queue where reviewers struggle to agree on an answer, generating low-quality labels that degrade your training data. Research on uncertainty quantification for LLMs confirms that current methods fail to distinguish these categories reliably when tested on deliberately ambiguous inputs.

Practical separation strategies:

  • Measure inter-annotator agreement on your review queue. If human reviewers disagree on a record at the same rate the model does, the record is genuinely ambiguous. Build this measurement into your review tooling.
  • Track calibration curves by data segment. If confidence scores are miscalibrated (high confidence, wrong answer) for a specific data segment, that points to model uncertainty on a specific distribution, not data ambiguity.
  • Use auxiliary signals. Cross-source inconsistency (the same entity labeled differently in two source systems), schema incompleteness (missing input fields), and outlier feature values are indicators of data quality issues, not model failures. Build these checks before the LLM step and tag records with data quality signals before processing.

A monitoring dashboard for an agentic data pipeline should show, for each pipeline run: confidence score distribution, per-tier routing percentages (auto-commit / audit / review), schema validation failure rate, and — when labeled ground truth is available — precision and recall broken down by confidence bucket. When the confidence distribution shifts or the routing percentages change unexpectedly, you have an early signal before downstream consumers notice.

Partial Failure Recovery

Batch jobs fail. A network partition during the job, a model API rate limit, a schema validation failure on an unexpected record format — any of these can stop a job partway through. The question is whether you can resume from where you left off or whether you have to re-process everything.

Design for checkpointing from the start:

  • Write-once intermediate state. As records are processed, write results to an intermediate store (object storage, a database staging table) keyed by source record ID. Before processing any record, check whether a result already exists. This makes partial reruns safe by default.
  • Distinguish transient from permanent failures. A model API timeout is transient — retry with backoff. A schema validation failure on a specific record is permanent — log it, skip the record, and report it in the job summary without blocking the rest of the batch.
  • Report failure granularity. Job-level success/failure metrics hide per-record failure rates. A job that "completed successfully" but skipped 2% of records is not actually complete. Track and report the exact set of failed record IDs so that operators can inspect them and decide whether they require reprocessing.

The operational pattern is: every batch job produces three outputs — enriched records, a log of skipped records with failure reasons, and a summary with per-failure-type counts. Only when skipped record counts fall within defined tolerances is the job considered complete.

Putting It Together

Agentic data pipelines are not simply "ETL with LLM calls substituted for rule-based transforms." The variable compute budget per record, the operational contract embedded in confidence thresholds, the distributed nature of structured output schemas, and the model-versus-data uncertainty distinction each require deliberate design choices that traditional batch engineering doesn't cover.

The teams that run these pipelines successfully treat them like distributed systems: with explicit failure modes, versioned contracts, observable state, and graceful degradation paths at every layer. The teams that don't end up with pipelines that work fine on Monday and silently drift into producing junk by Thursday — and nobody notices until a downstream dashboard goes wrong.

References:Let's stay in touch and Follow me for more thoughts and updates