Token Spend Is a Security Signal Your SOC Isn't Watching
The fastest-moving breach signal in your stack isn't in your SIEM. It's in a spreadsheet someone in finance opens on the first of the month. When an attacker steals an LLM API key, exploits a prompt injection to exfiltrate data, or rides a compromised tenant session to query an adjacent customer's memory, the footprint shows up first as a token-usage anomaly — long before any DLP rule fires, any auth alert trips, or any endpoint agent notices something weird. Billing sees it. Security doesn't.
That gap is not theoretical. Sysdig's threat research team coined "LLMjacking" after watching attackers rack up five-figure daily bills on stolen cloud credentials, and the category has since matured into an organized criminal industry with $30-per-account marketplaces and documented campaigns pushing victim costs past $100,000 per day. OWASP catalogued a startup that ate a $200,000 bill in 48 hours from a leaked key. A Stanford research group burned $9,200 in 12 hours on a forgotten token in a Jupyter notebook. The common thread in every one of these incidents: the billing graph told the story hours or days before anyone in security noticed.
The reason the signal gets dropped is organizational, not technical. FinOps owns the token dashboard because tokens are dollars. Security owns the incident response runbook because credentials are access. Neither team has made the other's pager ring. The token stream is a high-signal, low-false-positive abuse detector that happens to also be a cost control — and it's sitting on the wrong side of the org chart.
Why Token Spend Beats Traditional Abuse Signals
Traditional credential-abuse detection relies on signals that don't apply to LLM APIs. There's no login, so there's no impossible-travel login alert. There's no session token in a cookie, so there's no session hijacking pattern. The endpoint agent sees a normal outbound HTTPS connection to api.openai.com or api.anthropic.com — indistinguishable from legitimate traffic. DLP tools can't interrogate whether an outbound POST was instructed by the user or by an injected prompt in the context window. By the time the attack surfaces anywhere in conventional telemetry, the bill is already big.
Token spend is different in three ways that matter for detection.
First, it's bounded by economics. A legitimate user has a workload shape. A product has an average tokens-per-request. An agent has a plan-and-execute pattern that converges. Compromise breaks those shapes in distinctive ways: attackers reselling access burn inference continuously, prompt-injected agents emit oversized responses that carry exfiltrated context, and credential-stuffing pipelines probe multiple models in succession to fingerprint what the key can reach. None of those patterns look like the legitimate traffic on the same key.
Second, it's per-principal attributable. Unlike network telemetry, which aggregates at the VPC or host level, LLM API logs include the API key (or OAuth principal), the model, input token count, output token count, latency, and increasingly the tool-call shape. That's a natural primary key for a behavioral baseline. Per-key, per-session, per-tenant — you pick the aggregation and anomaly detection becomes a bread-and-butter statistics problem, not a novel research one.
Third, it arrives in near-real-time. The provider meters every request. Dashboards lag, but the log stream underneath does not. A CUSUM detector on token velocity or a straight standard-deviation gate on output-token-length will catch the anomaly inside the current billing hour, not the next billing cycle.
The Five Attack Patterns That Show Up Loudest
If you treat token spend as a detection signal, here are the shapes you'll actually see in the wild — and what each one usually means.
The 24/7 burn. Legitimate keys go quiet overnight and on weekends; resold access on underground markets does not. A key that shifts from bursty business-hours usage to a flat, high-utilization line with no diurnal pattern is the textbook signature of LLMjacking. Sysdig documented this against ten cloud-hosted LLM services after credentials were lifted from a vulnerable Laravel install. The victim's logs looked normal request-by-request; only the shape of the week gave the attack away.
The wrong-model spike. The Azure OpenAI abuse pattern Red Canary wrote up, and the OpenAI community report of a developer suddenly billed for gpt-4.5-preview and chatgpt-4o-latest models they'd never called, are variants of the same thing. An attacker harvests a key, enumerates accessible models, and starts spending on whichever one is most valuable on the resale market. If your baseline per-key model distribution is "97% gpt-4o-mini, 3% embeddings" and suddenly a tenth of the spend shifts to a flagship reasoning model, that's not a product change — that's someone else's product change.
Output-length exfiltration. Prompt injection attacks that read data out of a retrieval corpus or an agent's tool output almost always inflate the response length. If your chat assistant typically answers in 50–100 tokens and a single session starts emitting 3,000-token responses laced with what looks like structured data, that's an exfiltration tell. Straight output-token-length gating on the response log is a surprisingly effective first-pass filter. The payload you're looking for is big by construction.
The retry storm. Infinite-loop failures — the November 2025 LangChain dual-agent incident that burned $47,000 over eleven days being the canonical example — look identical to low-and-slow abuse at first glance. Both show up as sustained high request rates with modest per-request sizes. The difference is that legitimate retry storms usually come with matching error rates in application logs; abuse doesn't. Correlating the inference stream with upstream errors separates "we have a bug" from "we have an attacker."
The cross-tenant drift. In multi-tenant products, compromise of one tenant's API often manifests as that tenant's token pattern quietly starting to resemble another tenant's — same tool-call mix, same embedding-model ratio, same time-of-day profile. That's someone using Tenant A's credentials to do Tenant B's work, typically via a prompt-injection pivot through shared memory. It's invisible in the raw bill. It's obvious in the per-tenant embedding of usage behavior.
The Org Seam Where the Signal Gets Dropped
The reason most of this goes undetected isn't that the detection is hard. It's that no one in the building has been asked to do it.
FinOps teams watch token spend because tokens are a budget line. Their tooling — Vantage, Amnic, in-house dashboards, provider consoles — is optimized for allocation, forecasting, and cost attribution. When a spike fires, the FinOps runbook is: find the offending service, notify the team, review for waste. Nowhere in that runbook is "create a security ticket." In practice, by the time FinOps has escalated an anomaly, 24–48 hours have passed and the attacker is long gone with the receipts.
Security teams, meanwhile, are largely blind to the signal. The inference stream usually doesn't reach the SIEM. When it does, it's truncated provider metadata — not the per-key token counts the detection actually needs. The SOC has no baseline for what normal looks like, no dashboard that correlates spend with IAM events, and no playbook that treats a token spike as an indicator of compromise rather than an indicator of a bad forecast. The FinOps Foundation's 2026 survey puts it starkly: 98% of enterprises are managing AI spend; only 44% have any financial guardrails at all, let alone security ones.
The wiring to close this seam is not expensive. You need three things: a structured log stream from every LLM call (model, principal, input/output tokens, tool name, latency, tenant, feature tag); a baseline-and-deviation detector that runs on that stream; and a bidirectional escalation path so that a billing-side anomaly pages security and a security-side account compromise freezes the spend. None of that is a six-month project. Most of it is a week of plumbing and a Sentinel, Splunk, or Elastic rule.
A Detection Pipeline You Can Stand Up This Quarter
The pieces of this pipeline are individually boring. The value is in having all three wired together.
Emit structured inference logs at the gateway. Put every LLM call through a proxy or gateway you control — LiteLLM, an Envoy filter, a dedicated microservice, whatever fits. Log, per request: timestamp, principal (API key ID, not the secret), tenant, feature tag, model, input tokens, output tokens, latency, tool names invoked, and a request-tree root ID. That's the schema the rest of the pipeline runs on, and it's the schema FinOps wants too, so the stream has two customers.
Run a per-principal baseline detector on the stream. The math is not exotic. A rolling mean and standard deviation per principal on the rates that matter (requests per minute, tokens per request, output length, model distribution, tool-mix entropy) gets you 80% of the value. CUSUM and EWMA detectors catch slower drifts that standard-deviation gates miss. Seed the detector with two or three weeks of traffic per principal before alerting on it — the common failure mode is alerting during the warm-up window when variance is still stabilizing.
Wire the alerts to both the SOC and FinOps. The same deviation should create a security incident and a cost alert. Which one is primary depends on the pattern: output-length anomalies and wrong-model spikes route to security first; sustained overnight burn and retry storms route to FinOps first, with a security CC. Over time the routing rules tune themselves based on which team is closing which classes of alert.
The last piece is the playbook on the security side. When the alert fires, the default response should be: rotate the key, snapshot the last hour of traffic on that key for forensics, notify the owning team, and freeze any downstream workflows that depend on the key until you've determined whether it was compromise, runaway agent, or legitimate surge. That last branch — "legitimate surge" — is the one that keeps the signal useful; a lot of organizations kill the detector the first time a legitimate product launch lights it up. Keep the detector, raise the thresholds, and keep the incident runbook cheap to execute.
What Changes When Security Actually Watches the Spend
The deepest shift here isn't technical. It's that token spend becomes joint-custody telemetry.
Once security has eyes on the stream, a bunch of adjacent abuse categories become detectable that otherwise wouldn't be. Credential leaks into public repos — the kind that show up in the "12,000 API keys found in public LLM training datasets" reporting — get detected at first use because the key's usage profile changes the instant someone else picks it up. Prompt-injection campaigns against agents register as output-length and tool-mix anomalies instead of as vague user complaints. Cross-tenant leakage in multi-tenant memory stores leaves per-tenant usage fingerprints that are readable with unsupervised clustering. The detection cost is near zero once the pipeline exists; it's marginal code on a stream you're already collecting.
The long-term version of this is that "token telemetry" becomes a named category in the security stack, the same way "network telemetry" and "identity telemetry" are today. It sits next to IAM logs and endpoint logs in the data lake, it has its own detections in the SIEM, and the org finally stops losing six-figure bills to attacks that leave a blinking trail in a dashboard nobody on the security team has access to. The infrastructure is already in place. The only thing missing is the permission to treat the bill as a breach detector.
Until that permission gets granted, the next public LLMjacking story will read the same way as the last one: the victim noticed at the end of the month, the attacker was gone by the middle of the first week, and the signal was sitting in a chart the whole time.
- https://www.sysdig.com/blog/llmjacking-stolen-cloud-credentials-used-in-new-ai-attack
- https://genai.owasp.org/llmrisk/llm102025-unbounded-consumption/
- https://promptguardrails.com/blog/llmjacking-stolen-credentials-ai-budget-attack
- https://www.upwind.io/glossary/llmjacking-what-is-it-and-why-is-it-a-concern
- https://redcanary.com/blog/threat-detection/azure-openai-abuse/
- https://www.monad.com/blog/detection-engineering-for-claude-code-part-1
- https://www.armosec.io/blog/how-to-detect-prompt-injection-in-production-ai-agent-workloads/
- https://www.vantage.sh/blog/finops-for-ai-token-costs
- https://community.openai.com/t/unexpected-67-token-usage-spike-with-models-i-never-use/1144995
- https://hackread.com/hackers-monetize-llmjacking-selling-stolen-ai-access/
