Skip to main content

Token Spend Is a Security Signal Your SOC Isn't Watching

· 11 min read
Tian Pan
Software Engineer

The fastest-moving breach signal in your stack isn't in your SIEM. It's in a spreadsheet someone in finance opens on the first of the month. When an attacker steals an LLM API key, exploits a prompt injection to exfiltrate data, or rides a compromised tenant session to query an adjacent customer's memory, the footprint shows up first as a token-usage anomaly — long before any DLP rule fires, any auth alert trips, or any endpoint agent notices something weird. Billing sees it. Security doesn't.

That gap is not theoretical. Sysdig's threat research team coined "LLMjacking" after watching attackers rack up five-figure daily bills on stolen cloud credentials, and the category has since matured into an organized criminal industry with $30-per-account marketplaces and documented campaigns pushing victim costs past $100,000 per day. OWASP catalogued a startup that ate a $200,000 bill in 48 hours from a leaked key. A Stanford research group burned $9,200 in 12 hours on a forgotten token in a Jupyter notebook. The common thread in every one of these incidents: the billing graph told the story hours or days before anyone in security noticed.

The reason the signal gets dropped is organizational, not technical. FinOps owns the token dashboard because tokens are dollars. Security owns the incident response runbook because credentials are access. Neither team has made the other's pager ring. The token stream is a high-signal, low-false-positive abuse detector that happens to also be a cost control — and it's sitting on the wrong side of the org chart.

Why Token Spend Beats Traditional Abuse Signals

Traditional credential-abuse detection relies on signals that don't apply to LLM APIs. There's no login, so there's no impossible-travel login alert. There's no session token in a cookie, so there's no session hijacking pattern. The endpoint agent sees a normal outbound HTTPS connection to api.openai.com or api.anthropic.com — indistinguishable from legitimate traffic. DLP tools can't interrogate whether an outbound POST was instructed by the user or by an injected prompt in the context window. By the time the attack surfaces anywhere in conventional telemetry, the bill is already big.

Token spend is different in three ways that matter for detection.

First, it's bounded by economics. A legitimate user has a workload shape. A product has an average tokens-per-request. An agent has a plan-and-execute pattern that converges. Compromise breaks those shapes in distinctive ways: attackers reselling access burn inference continuously, prompt-injected agents emit oversized responses that carry exfiltrated context, and credential-stuffing pipelines probe multiple models in succession to fingerprint what the key can reach. None of those patterns look like the legitimate traffic on the same key.

Second, it's per-principal attributable. Unlike network telemetry, which aggregates at the VPC or host level, LLM API logs include the API key (or OAuth principal), the model, input token count, output token count, latency, and increasingly the tool-call shape. That's a natural primary key for a behavioral baseline. Per-key, per-session, per-tenant — you pick the aggregation and anomaly detection becomes a bread-and-butter statistics problem, not a novel research one.

Third, it arrives in near-real-time. The provider meters every request. Dashboards lag, but the log stream underneath does not. A CUSUM detector on token velocity or a straight standard-deviation gate on output-token-length will catch the anomaly inside the current billing hour, not the next billing cycle.

The Five Attack Patterns That Show Up Loudest

If you treat token spend as a detection signal, here are the shapes you'll actually see in the wild — and what each one usually means.

The 24/7 burn. Legitimate keys go quiet overnight and on weekends; resold access on underground markets does not. A key that shifts from bursty business-hours usage to a flat, high-utilization line with no diurnal pattern is the textbook signature of LLMjacking. Sysdig documented this against ten cloud-hosted LLM services after credentials were lifted from a vulnerable Laravel install. The victim's logs looked normal request-by-request; only the shape of the week gave the attack away.

The wrong-model spike. The Azure OpenAI abuse pattern Red Canary wrote up, and the OpenAI community report of a developer suddenly billed for gpt-4.5-preview and chatgpt-4o-latest models they'd never called, are variants of the same thing. An attacker harvests a key, enumerates accessible models, and starts spending on whichever one is most valuable on the resale market. If your baseline per-key model distribution is "97% gpt-4o-mini, 3% embeddings" and suddenly a tenth of the spend shifts to a flagship reasoning model, that's not a product change — that's someone else's product change.

Output-length exfiltration. Prompt injection attacks that read data out of a retrieval corpus or an agent's tool output almost always inflate the response length. If your chat assistant typically answers in 50–100 tokens and a single session starts emitting 3,000-token responses laced with what looks like structured data, that's an exfiltration tell. Straight output-token-length gating on the response log is a surprisingly effective first-pass filter. The payload you're looking for is big by construction.

The retry storm. Infinite-loop failures — the November 2025 LangChain dual-agent incident that burned $47,000 over eleven days being the canonical example — look identical to low-and-slow abuse at first glance. Both show up as sustained high request rates with modest per-request sizes. The difference is that legitimate retry storms usually come with matching error rates in application logs; abuse doesn't. Correlating the inference stream with upstream errors separates "we have a bug" from "we have an attacker."

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates