Skip to main content

AI for SRE Log Analysis: The Tiered Architecture That Actually Works

· 9 min read
Tian Pan
Software Engineer

When teams first wire an LLM into their log pipeline, the demo is impressive. You paste a stack trace, and GPT-4 explains the root cause in plain English. So the natural next step is obvious: automate it. Send all your logs through the model and let it find the problems.

This is how you burn $125,000 a month and page your on-call engineers with hallucinations.

The math is simple and brutal. A mid-size production system generates around one billion log lines per day. At roughly 50 tokens per log entry, that's 50 billion tokens daily. Even at GPT-4o's discounted rate of $2.50 per million input tokens, you're looking at $125,000 per day before accounting for output costs, retries, or inference overhead. Real-time frontier model analysis of streaming logs is not an optimization problem — it's the wrong architecture.

But the harder failure isn't financial. It's operational. LLMs applied naively to logs produce alert noise at a scale that on-call teams cannot absorb. Studies consistently show that over 70% of SRE teams rank alert fatigue as a top-three concern. SOC teams filter out 67% of their alerts as false positives. When your AI-powered observability tool makes this worse, engineers don't complain about it — they route around it. The adoption curve for AI log analysis tools that increase alert volume is flat.

The architecture that actually works in production is tiered. It gates expensive LLM inference behind cheap, fast anomaly detection, so you pay frontier model prices only for the tiny fraction of logs that matter.

Why Logs Cannot Live in the Hot Path

Before designing the tiers, understand what "real-time" means for each SRE use case.

Alerting needs a decision in 100 milliseconds to 10 seconds. That window is served by metrics — CPU, memory, error rate, request latency. Logs have an ingestion latency of 3 to 10 minutes in most cloud platforms. By the time your logs are queryable, the spike that would have triggered an alert has already either resolved or escalated. Google's SRE book is explicit on this: saturation detection and alerting belong to metrics and structured time-series data, not log text.

Diagnosis happens after an alert fires, on a timeline measured in minutes. A human triage decision takes 5 to 15 minutes. LLM diagnosis latency of 5 to 60 seconds is completely acceptable here.

Postmortem analysis has no real-time constraint at all. Batch processing with a 50% API discount is appropriate.

Most teams that try to apply LLMs to logs are conflating these three windows. They try to use LLM analysis for alerting — which logs cannot support — and end up with an expensive, slow, noisy system that fails at the one thing that matters to on-call engineers: waking them up at the right moment, for the right reason.

The Three-Tier Architecture

Tier 1: Fast anomaly detection (milliseconds, negligible cost)

The first tier does one thing: identify which logs are worth looking at. It runs as a streaming component consuming from your centralized log sink — Kafka, S3, or CloudWatch — and passes only anomalies downstream.

The workhorse here is log clustering. The Drain algorithm, standard in production log parsing since 2017, groups log lines into templates by identifying fixed tokens and variable components. One billion raw log lines may collapse to 5,000 unique templates. From there, statistical methods — Isolation Forest, autoencoders, or simple baseline deviation — flag templates whose frequency or co-occurrence patterns fall outside normal bounds.

This step costs almost nothing. It runs on commodity compute. Its latency is measured in milliseconds per log entry. And it reduces the candidate set by two to three orders of magnitude before any expensive processing begins.

Hybrid approaches using Drain plus traditional ML classifiers achieve F1 scores between 86% and 95% on standard log anomaly benchmarks. You are not trading quality for cost here — the tradeoff simply does not exist at this layer.

Tier 2: Confidence scoring (seconds, minimal cost)

Not every anomaly is worth LLM attention. Tier 2 applies lightweight ML models — decision trees, k-nearest neighbor ensembles — to rank anomaly candidates by diagnostic confidence. It answers: is this anomaly significant enough to warrant a $0.01–0.10 LLM call?

This tier runs in seconds on small feature vectors derived from tier 1's clustering output. Its primary function is further reducing the population that reaches tier 3, and providing a confidence signal that helps triage the LLM's response.

At scale, this means a system that processes one billion log lines per day might send 50 to 200 anomaly clusters to the LLM daily. That's a daily LLM cost of $5 to $20, not $125,000.

Tier 3: LLM diagnosis with runbook grounding (5–60 seconds, targeted cost)

Only confirmed, high-confidence anomalies reach the LLM. When they do, the prompt structure determines whether you get actionable output or verbose text that nobody reads during an incident.

The key insight from production deployments is that raw log dumps in the prompt produce worse results than structured diagnostic context. The effective pattern:

  • Input: The anomaly signature from tier 1 (template pattern, deviation magnitude, affected services) plus structured metadata (service name, deployment version, recent changes)
  • Runbook injection: A JSON-formatted runbook indexed by service and failure type, retrieved via exact match or similarity search
  • Constrained output format: Three fields only — what happened, why it matters, what to do next

This is not summarization. It is structured decision support, and the prompt should make that explicit. "Summarize these logs" produces paragraphs. "Apply the connection timeout runbook to these symptoms and output: [cause, impact, remediation steps]" produces something an on-call engineer can act on in 30 seconds.

Microsoft Azure's incident management system, deployed across 15-plus internal teams by early 2025, follows this pattern. One team measured a 38% reduction in mean time to resolution and 90% diagnostic accuracy. The accuracy comes from grounding — the LLM is not reasoning from scratch, it is applying known patterns to new evidence.

The False-Positive Problem and Why It Kills Adoption

Tiered architecture addresses cost. It does not automatically address false positives. These are distinct problems, and conflating them is what causes teams to declare "we tried AI for logs and it didn't work."

A common mistake is feeding the LLM every anomaly the first tier surfaces. Even with good clustering, tier 1 will surface anomalies that are unusual but benign — scheduled jobs, known spikes, configuration changes that look odd in aggregate. Without tier 2 confidence scoring, these reach the LLM, which produces a confident-sounding diagnosis of a non-problem. The on-call engineer investigates, finds nothing, and adds your AI alert to their mental ignore list.

The TEQ model (tiered ensemble with qualitative scoring) demonstrated a 54% reduction in false positives with a 95.1% detection rate. That combination — fewer false alarms, not fewer alerts overall — is what defines success for on-call adoption. Engineers will tolerate alert volume if precision is high. They will not tolerate low precision regardless of how impressive the LLM explanation sounds.

Tracking precision and recall separately for each tier is mandatory instrumentation. If tier 1 surface rates increase while tier 3 dismissal rates also increase, you have a tier 2 calibration problem. If engineers mark LLM diagnoses as unhelpful at more than 20% rate, you have a runbook coverage gap.

Integration Patterns

Most production deployments integrate the tiered system as a sidecar microservice rather than modifying the primary observability stack. The pattern:

  1. Primary log sink (Kafka or S3) fans out to two consumers: the existing indexer (Elasticsearch, Loki) and the anomaly detection pipeline.
  2. The anomaly detection pipeline writes confirmed anomalies to a secondary queue.
  3. A separate service consumes that queue, handles LLM API calls, and writes diagnostic output to an incident management system (PagerDuty, Opsgenie, or an internal ticketing system).

This decoupling means the AI analysis layer can fail independently without affecting the primary observability stack. It also enables separate scaling — tier 1 is always-on and horizontally scalable, tier 3 is bursty and can queue during API rate limit events without losing data.

For the runbook integration, the effective patterns are either a structured JSON store retrieved by exact-match service/failure-type keys, or a small vector index for semantic retrieval when the failure type doesn't map cleanly to a known runbook. Start with exact match — it is faster, cheaper, and more predictable.

What to Instrument

Beyond standard LLM tracing (token counts, latency, cost per call), add these metrics specific to the tiered pipeline:

  • Tier 1 surface rate: What fraction of log volume becomes anomaly candidates. A sudden increase often means a deployment or traffic change, not a system problem.
  • Tier 2 pass rate: What fraction of tier 1 anomalies reach the LLM. Calibrate this against false-positive feedback from on-call engineers.
  • LLM diagnostic precision: Percentage of LLM-generated diagnoses marked as accurate by the engineer who handled the incident. Collect this at incident close, not asynchronously.
  • Runbook hit rate: What fraction of tier 3 calls successfully matched a runbook. Low hit rate means your runbook coverage is falling behind your service portfolio.

These four metrics give you the feedback loops to improve each tier independently. IBM's production deployment across 1,376 cases showed that 60% of incidents saw a savings of 30 or more minutes per incident once the pipeline was calibrated — but calibration required these feedback loops to be in place from week one.

The Common Failure Mode

Teams that implement this correctly and then gradually break it share a common pattern: they add more log sources to tier 1 without recalibrating tier 2, and they stop collecting diagnostic precision feedback from on-call engineers.

Without recalibration, tier 2's confidence thresholds go stale. New services have different log patterns, and the ensemble starts passing more low-quality candidates to the LLM. Without feedback, you don't notice until on-call engineers have already tuned out the AI diagnoses — at which point you have an expensive system that nobody uses.

The architecture is not a one-time implementation. It requires an operational discipline that looks more like model serving than traditional monitoring infrastructure: drift detection, periodic threshold recalibration, and continuous precision tracking tied to actual incident outcomes.

That operational investment is what separates teams that reduce MTTR with AI-assisted log analysis from teams that add another observability tool to their ignored list.

References:Let's stay in touch and Follow me for more thoughts and updates