Skip to main content

The SIEM Bill Your AI Feature Forgot to Include

· 10 min read
Tian Pan
Software Engineer

The math is simple and nobody did it. Pre-AI, a single user action — "summarize this ticket," "send this email" — produced one application log line. Post-AI, the same action emits a request log, an LLM call trace, a tool-invocation span for each tool the agent called, a retrieval span per chunk it read, a response log, and an eval log if you sample for offline scoring. The fan-out for one user click is now 30 to 50 records on the floor of your observability pipeline, and that's before retries, before sub-agents, before the planner-executor split that 2x's everything again.

You shipped an AI feature in Q1. In Q2, your security director walks into a budget review with a Splunk renewal that's 4x higher than last cycle. Nobody on the AI team is in the room. The conversation that happens next — about who owns the cost, why the threat-detection rules stopped working, and whether legal hold on every conversation is actually mandatory — is a conversation you should have had at design time and didn't, because the cost didn't show up on the LLM invoice. It showed up downstream, in a tool the AI team has never logged into.

This post is about the second-order cost of AI features that nobody models: what happens to your audit-log volume, your SIEM ingestion, your threat-detection baselines, and your legal-hold storage when the agent ships. The first-order cost — the token bill — is the one everyone watches. It's also the one that's easy to forecast. The second-order cost is the one that quietly eats your security observability stack and arrives as a renewal surprise six months after launch.

The 30x telemetry multiplier nobody briefed security on

A traditional REST endpoint produces a small, well-shaped trace: an inbound request span, a couple of database calls, maybe an outbound HTTP call, and a response. Two to three spans per user action is the steady-state baseline most platform teams have been operating against for a decade. Their SIEM correlation rules, their cost forecasts, their sampling defaults — all of it is calibrated to that order of magnitude.

An LLM call is not a database call. A single chat completion produces 8 to 15 spans once you instrument it properly: the gateway span, the model span, child spans for input formatting, output parsing, function-call extraction, content moderation, a token-usage event, and an eval-sample span if you're sampling for offline grading. A multi-step agent that loops through 5 reasoning steps before answering produces 40 to 75 spans for what the user perceives as one interaction. Industry analyses now put AI workloads at 10 to 50 times the telemetry volume of traditional services.

That multiplier compounds across three axes nobody briefs. Retries: agents retry tool calls and model calls aggressively because reasoning failures are often transient — every retry is another full sub-trace. Sub-agents: the moment your top-level agent delegates to a planner or specialist agent, you have a tree where each node fans out to its own LLM and tool spans. Cardinality: prompts, model versions, tool names, and tenant IDs become high-cardinality span attributes, and metric tiers that were free at 10K series become overage line-items at 200K.

The result is that "we added an AI feature to checkout" lands in observability as a 4-to-8x increase in ingestion, and Datadog customers who instrumented LLM-call payloads have publicly reported observability bill jumps of 40 to 200 percent depending on how much of the prompt and completion they captured. The bill is not where the surprise stops. The threat-detection layer that runs on top of the ingestion stream is the part that breaks first.

Why your threat-detection rules silently stop working

SIEM correlation rules are statistical. They assume a baseline of "normal" activity per user, per service, per endpoint, and they fire when the rate or pattern deviates. Those baselines were trained against pre-AI traffic. Once an agent ships, every user action looks like a burst — multi-span, multi-tool, multi-retry — and rules tuned for "alert if a single user generates more than 500 events per minute" start firing every time someone uses the assistant.

The first thing that happens is rule fatigue. The security team gets paged on what is just an agent doing its job. The second thing is worse: to keep the noise down, somebody loosens the rule, and now the threshold that used to catch a credential-stuffing pattern is well above where credential-stuffing actually happens. The rule didn't get deleted. It got de-tuned into uselessness, and the team that did it didn't know they were degrading detection coverage because nobody told them the agent traffic shape was new.

The deeper problem is that the SIEM correlation engine itself has tier limits. Splunk Enterprise Security and most equivalents are priced per ingested GB per day, with security-tier ingest often running $250 to $400 per GB at list. When ingestion blows past the licensed tier, the engine starts dropping events to keep itself alive. Drops are silent by default. The events most likely to drop are the long-tail ones — the unusual patterns — which is exactly the population the threat-detection rules are trying to find. Your detection coverage degrades and your dashboards still look green.

The lesson is that turning on an AI feature is, from the SIEM's point of view, a workload migration as significant as moving from monolith to microservices was a decade ago — and it deserves the same kind of cross-functional design review. It rarely gets one because the AI team and the security team don't share an on-call rotation and the cost signal lives in a different P&L.

Pre-AI, "communication that must be retained" meant email, Slack, and recorded calls. Most SaaS products did not consider their application telemetry to be communication. Once an AI feature is in the product, that line moves: every user prompt and every model response is a conversation, often with a system that the user perceives as an interlocutor, and legal teams are rapidly defaulting to "yes, retain it under the same retention rules as customer messages."

The retention math is brutal. HIPAA mandates six years of retention on records that touch PHI. SOX and many financial regulations push to seven. GDPR doesn't fix a number but requires that retention be tied to a documented purpose — and "audit defense for a model's clinical recommendation" is a purpose that justifies multi-year retention even if the operational SIEM only needs 30 days. The same conversation now has to live in two places: a hot tier for security analytics, and a cold tier for legal defense, with append-only integrity and separate-system checksums so the logs can be defended in a deposition.

The cost of that cold tier is not the surprise. Object storage is cheap. The surprise is the cost of getting data out of the SIEM and into the cold tier in a form that a legal team can actually use — which means structured, queryable, with provenance metadata, with PII handling that satisfies the next-largest jurisdiction you operate in. Teams that didn't design for this end up paying the SIEM vendor to retain everything at hot-tier prices for years, because the export pipeline was never built and the renewal date arrived first.

A useful frame: treat retention as a per-tenant policy at design time. Some tenants — healthcare, finance, regulated EU — get the seven-year cold tier with full content. Most tenants get a 90-day operational tier with redacted content and a metadata-only archive. The default cannot be "retain everything everywhere forever," because the default is what you ship when the product team forgot to ask.

Design choices that compress the surface

The instinct after seeing the bill is to turn off telemetry. Don't. The instinct that works is to separate the security pipeline from the engineering pipeline, treat them as different products, and engineer each one against its own budget.

A small set of architectural moves does most of the work:

  • Structured logs over verbose traces for the security path. The security pipeline does not need 75 spans per agent run. It needs a small number of high-signal records: who acted, what tool was invoked, what the authorization decision was, what data class was touched, and the outcome. Engineering observability can keep the full trace; security gets a separate emitter that produces a flat record per privileged action.
  • Sampling that preserves errors and privileged actions while sub-sampling the normal path. Head sampling at 1–5 percent for routine traffic combined with tail sampling that always retains anything that errored, anything that touched a privileged tool, and anything flagged by an eval sampler. The OpenTelemetry community reports tail sampling alone gets teams 70–80 percent of the way to a sustainable budget, with the rest coming from filtering and attribute scrubbing.
  • Hot and cold tiers separated by retention policy, not by service. Thirty days hot in the SIEM for security analytics, seven years cold in object storage for compliance. The cold path is append-only with offline checksums; the hot path can be lossy and re-derivable.
  • OTel attribute scrubbing at the collector. The OpenTelemetry GenAI semantic conventions explicitly call out that prompts and completions can either be inlined as span attributes (cheap but verbose) or stored externally with a reference on the span (expensive infra, much smaller spans). For the security pipeline, externalize the content and keep references — the SIEM correlation engine doesn't need 8KB prompts in the index.
  • Per-tenant retention policy, not a global default. Build the export and redaction pipeline before legal asks for it. The cost of building it after is paying the SIEM hot-tier rate for two years of conversations because the cold-tier export wasn't there yet.

The pattern across these moves is the same: design two independent telemetry pipelines and let them have different budgets. The engineering pipeline optimizes for "can the on-call engineer debug a production issue in five minutes." The security pipeline optimizes for "did a user trigger a policy violation, and can we prove it." They share the OTel collector, not the storage backend.

The cross-functional review that has to happen before launch

The architectural realization is the easy part. The organizational realization is harder: shipping an AI feature has a downstream cost in your security observability stack that nobody models because it doesn't show up on the LLM invoice. The cost lands on a team that didn't get to participate in the design decision, and it lands quarters after the feature launched, in the form of a renewal that has to be negotiated under time pressure.

Three habits prevent this. First, the AI feature kickoff includes a security observability owner from the start, with a budget signoff on projected ingest volume, retention policy, and detection-rule rebaselining. Second, the launch checklist includes a 30-day post-ship telemetry review where actual ingestion is compared against the forecast, before the SIEM tier limit forces a renegotiation. Third, the cost dashboard the AI team watches includes a downstream-observability line, not just the LLM provider's invoice — the actual leading indicator of total cost-of-ownership.

Forecast it before launch. Sample aggressively. Tier by retention policy. And invite security to the design review while the design is still cheap to change. The teams that do this ship AI features with steady-state operating costs they can defend at renewal. The teams that don't end up explaining at the postmortem how a feature that paid for itself in token bills quietly became the most expensive line item in the security org's budget.

References:Let's stay in touch and Follow me for more thoughts and updates