Skip to main content

Token-Aware Logging: When Your Traces Cost More Than the Inference They Observe

· 12 min read
Tian Pan
Software Engineer

A team I talked to last quarter spent six weeks chasing a memory pressure alert on their agent platform. The agents were cheap — a few cents a run. The traces were not. Their telemetry pipeline was eating three times the budget of the LLM calls it was instrumenting, and most of the spend went to fields nobody had read in months: full prompt bodies stored on every span, tool outputs duplicated across parent and child traces, and an LLM-judge evaluator that re-paid the inference bill on every captured trace.

This is the AI observability cost crisis in miniature. A 2026 industry write-up modeled a customer support bot with 10,000 conversations and five turns each — that comes out to 200,000 LLM invocations, 400 million tokens, and roughly a million trace spans per day. Datadog users widely report observability bills jumping 40-200% after they instrument AI workloads on the same backend that handled their REST APIs. The pipeline is paying twice for the same tokens: once to generate them, once to remember them.

The fix is not "log less." The fix is to treat observability for AI systems as a workload with its own unit economics, separate from the request-response telemetry traditional services emit. Traditional logging is structured fields you can compress and forget; AI logging is unbounded text bodies that re-enter the inference budget every time something reads them. That distinction is what "token-aware logging" means.

Why AI Telemetry Has a Different Cost Curve

A traditional API call emits a few hundred bytes of structured logs — request method, path, latency, status, a handful of attributes. A single agent turn emits a kilobyte of structured fields and ten to fifty kilobytes of text bodies: the assembled prompt, the retrieved documents, the tool arguments, the tool result, the model's reasoning trace, the assistant's reply, and often the same content again on the parent span because both the orchestration layer and the underlying call captured it independently.

The volume multiplier is well-documented. RAG pipelines and agent workflows produce ten to fifty times the telemetry of a comparable REST endpoint, and trace storage scales four to eight times higher than traditional request-response flows because each user turn fans out into intent classification, retrieval, ranking, generation, and validation — each of which is its own span carrying its own copy of the input.

Two patterns make this worse than it has to be. The first is double capture. Most agent frameworks instrument both the high-level "user turn" trace and the low-level provider client, and the same prompt ends up on both spans. The second is re-evaluation. Teams attach an LLM-as-judge evaluator to scan production traces for quality, and that evaluator runs its own inference on every trace it touches. One AI lead reported their evaluation cost ten times the agent workload. The accepted best-practice ratio is closer to 1:1, and most teams do not measure it because the evaluator runs in a separate cost center from the agent.

The practical consequence is that you cannot reason about AI observability cost the way you reason about a Datadog ingest line. You have to think of it as a second model deployment that happens to consume your first model's outputs as input.

The Trace as a Heap: What Actually Costs Money

Think of a trace as a heap allocator. Each field you attach to a span is a long-lived allocation. The structured ones — duration, model name, prompt token count, output token count, error code, retry count, latency percentiles — are pennies, indexed efficiently, and useful in aggregate. The text bodies — the full assembled prompt, the raw tool output, the retrieval results, the reasoning trace — are kilobytes apiece, almost never queried by humans, and the load-bearing reason your storage tier doubled.

A token-aware mental model sorts every field into one of three classes. Structured metadata is cheap and queryable; capture it on every trace. Text bodies are expensive and rarely useful in raw form; capture them deliberately. Re-derivable signals — sentiment scores, judge ratings, semantic similarity — are even more expensive than text because they cost tokens to produce, so they should be sampled, not exhaustive.

The mistake teams make is treating all three classes the same. A typical instrumentation library captures the full prompt because it is "always there," and the team ships an alert on tail-latency P99 that needs only the duration field. The alert fires on the duration; the storage bill comes from the prompt. Once you look at attributable cost per field, the optimization becomes obvious: structured fields stay, bodies become opt-in, and re-derived signals run on a sample.

This is also why span size limits in the major observability platforms — typically 256KB per field — are not just a performance guardrail. They are a hint that the platform was not designed to store inference inputs at full fidelity. When your spans are getting truncated, that is the platform telling you that you are using it as cold storage for a workload that should have a different retention tier.

Prompt Fingerprints Instead of Full Prompts

The single highest-leverage move on a verbose AI telemetry pipeline is replacing full-prompt capture with prompt fingerprinting. Instead of storing the assembled 6,000-token prompt on every trace, you store a hash of the prompt, plus the variables that were interpolated into it, plus a pointer to the prompt template version. A short-lived sample of full prompts goes to a separate, cheaper bucket — say, one in a hundred at a daily quota — for cases where you actually need to see what was sent.

The cost difference is dramatic. A SHA-256 hash plus a few interpolated values is under a kilobyte. The corresponding prompt body is typically twenty to a hundred times larger. For a workload at the scale of the customer support bot above, the swap takes prompt-related ingest from the dominant line item to a rounding error.

The reason fingerprinting is acceptable is that the prompt body is rarely the actual debug signal. When an agent fails, you almost always need three things: which template version was active, what variables were filled in, and what the model returned. The template version comes from your prompt registry; the variables are short and structured; the model output is the only piece you usually need in full. The prompt itself is reconstructable from the first two if you ever need it, and you can pull the rare full-fidelity sample for the cases where reconstruction is not enough.

Fingerprinting also unlocks de-duplication you cannot do with raw text. If 30% of your traffic hits the same prompt template with the same variables — a common pattern in tool-heavy agents that re-call retrieval with similar queries — your fingerprint counts collapse those into a single entry with a count. That tells you something useful (this template handles a third of your load) without storing 30% of your traffic in your trace store.

Sampling That Knows About Outcomes

Head sampling — keeping a fixed fraction of incoming traces — is the wrong default for AI workloads. It treats success and failure the same, so you end up with thousands of boring traces and a thin sample of the failures you actually need to debug. The right default is outcome-aware tail sampling: capture 100% of error traces, 100% of low-quality outputs, and 100% of cost-anomaly traces, then sample successful traces at a low rate.

Concretely, an outcome-aware policy looks like this:

  • Always keep traces where the agent threw an exception, hit a tool error, or tripped a safety filter.
  • Always keep traces whose token cost exceeds two standard deviations above the workload's median.
  • Always keep traces with a low judge or user-feedback score.
  • Sample the rest at 5-10%.

The decision has to happen at trace export, not at ingest, which is why this is called tail sampling — you wait until the trace is complete and the outcome is known before you decide whether to keep it. The OpenTelemetry tail sampler supports this, and most paid platforms now expose some equivalent. The savings show up on the storage bill, not the ingest bill, but for AI workloads storage usually dominates anyway.

The same principle applies to LLM-judge evaluation. Running the judge on every captured trace is the pathology that produced the 10× cost ratio one team reported. Running the judge on a 1-5% stratified sample — stratified by user cohort, query category, and judge confidence — gives you the trend signal you actually want without paying full inference twice on every interaction. The teams that hold to a roughly 1:1 evaluation-to-workload cost ratio are the ones that treat the judge as a sampling problem, not an exhaustive scan.

Retention Tiers and the COGS Mindset

Even after sampling and fingerprinting, you will have a real volume of trace data, and most of it loses value within a window measurable in days. A retention model that matches that reality is the last big lever.

The pattern that works is two-tier retention. Hot tier holds the last seven to thirty days at full fidelity — structured fields, sampled bodies, judge scores — for active debugging and alerting. Cold tier holds aggregated structured fields and prompt-fingerprint counts for ninety days or more, for cost analysis, drift monitoring, and capacity planning. The bodies do not move to cold tier; they expire when they leave hot. If a debugging session needs an old prompt body, you pay the small cost of an explicit replay or rerun rather than the standing cost of keeping every body warm forever.

This is the part that requires an organizational shift. Traditional observability is measured by retention duration — "we keep ninety days of logs" — because the per-record cost is small and the value of historical access is high. AI observability has to be measured by COGS contribution per active user or per agent run, because the per-record cost is large enough to show up on the P&L. A team that treats their telemetry pipeline as a fixed-cost utility will keep getting surprised by their bill; a team that gives it a unit-economics line will catch the regression the first month.

The practical effect of putting AI observability on a COGS line is that someone owns the cost. Without that ownership, the default outcome is the platform team adds instrumentation to debug an incident, the bill grows, nobody notices because it is bundled into the platform line, and three quarters later finance asks why observability spend doubled. With explicit attribution — telemetry cost per agent feature, per workload, per tenant — the team that ships the agent owns the trace bill, which is the only structure that consistently produces sane sampling defaults.

When the Trace Is the Product

There is one important counter-pattern to all of this. Some products are their traces — agent platforms where the customer's interaction with the trace UI is the thing they paid for, AI coding tools that show step-by-step reasoning, evaluation platforms where the trace is the artifact under analysis. In those products, the trace cannot be aggressively sampled because the trace is the deliverable, not the metadata.

For those workloads the answer is not "log less." It is to recognize that the trace store has become a primary data store, not a telemetry sidecar, and to build it accordingly: indexed appropriately for the queries that drive the product, replicated according to the durability the product needs, and priced into the cost of the product the way you price any primary data store. The tools that give you a flat-rate "ingest everything" pricing model start hurting badly at this scale, which is why the more mature agent-platform companies are building their own trace stores rather than running on Datadog's general-purpose ingest.

If you are in this position, the shift is the same one that happened to log management ten years ago: the moment your "logs" became your "data," running them on a generic observability backend stopped making sense.

What to Build Next

If your AI telemetry pipeline has grown organically alongside an agent product, three concrete moves usually pay for themselves in the first month: replace prompt-body capture with prompt fingerprinting plus sampled full-fidelity capture; convert head sampling to outcome-aware tail sampling, with the judge running on a stratified sample rather than the full stream; and add a unit-economics line for telemetry cost per agent run, owned by the team that ships the agent.

The bigger shift is the mental model. Observability for AI systems is its own COGS line. The traces are not free, the bodies you capture are not free, and the LLM-judge that runs on every trace is paying the inference bill twice. Once a team internalizes that, the optimizations become obvious, and the question stops being "how do we afford this much logging" and starts being "what do we actually need to see."

References:Let's stay in touch and Follow me for more thoughts and updates