Skip to main content

32 posts tagged with "sre"

View all tags

AI for SRE Log Analysis: The Tiered Architecture That Actually Works

· 9 min read
Tian Pan
Software Engineer

When teams first wire an LLM into their log pipeline, the demo is impressive. You paste a stack trace, and GPT-4 explains the root cause in plain English. So the natural next step is obvious: automate it. Send all your logs through the model and let it find the problems.

This is how you burn $125,000 a month and page your on-call engineers with hallucinations.

The math is simple and brutal. A mid-size production system generates around one billion log lines per day. At roughly 50 tokens per log entry, that's 50 billion tokens daily. Even at GPT-4o's discounted rate of $2.50 per million input tokens, you're looking at $125,000 per day before accounting for output costs, retries, or inference overhead. Real-time frontier model analysis of streaming logs is not an optimization problem — it's the wrong architecture.

The On-Call Runbook for AI Systems That Nobody Writes

· 10 min read
Tian Pan
Software Engineer

Your p99 latency just spiked to 12 seconds. The alert fired at 3:14am. You open the runbook and find instructions for: checking the database connection pool, verifying the load balancer, restarting the service. You do all three. Latency stays elevated. The service is not down — it is up and responding. But something is wrong. It turns out the model started generating responses three times longer than usual because a recent prompt change accidentally unlocked verbose behavior. The runbook had no page for that.

This is the new category of on-call incident that engineering teams are not prepared for: the system is operational but the model is misbehaving. Traditional SRE runbooks assume binary failure states. AI systems fail probabilistically, and the symptoms do not look like an outage — they look like drift.

AI-Assisted Incident Response: How LLMs Change the SRE Playbook Without Replacing It

· 11 min read
Tian Pan
Software Engineer

Here is the paradox that nobody in the AIOps vendor space is advertising: organizations that invested over $1M in AI tooling for incident response saw their operational toil rise to 30% of engineering time—up from 25%, the first increase in five years. Teams expected the automation to replace manual work. Instead, they got a new job: verifying what the AI said before acting on it. The old tasks didn't go away. A verification layer appeared on top.

This is not an argument against AI in incident response. The same data shows a 40% reduction in mean time to resolution when AI is integrated well, and some teams report cutting investigation time from two hours to under thirty minutes. The argument is more precise: the failure modes of AI copilots are qualitatively different from the failure modes of traditional SRE tooling, and most teams aren't set up to catch them.

AI On-Call Psychology: Rebuilding Operator Intuition for Non-Deterministic Alerts

· 11 min read
Tian Pan
Software Engineer

The first time an on-call engineer closes a page with "the model was just being weird again," the team has quietly crossed a line. That phrase does three things at once: it declares the issue un-investigable, it classifies future similar alerts as noise, and it absolves the rotation of documenting what happened. A week later the same signature will fire, someone else will see "already dismissed once," and a real regression will live in production until a customer tweets about it.

This pattern is not laziness. It is the predictable outcome of running standard SRE intuition on a system that no longer behaves deterministically. Classical on-call training teaches engineers to treat identical inputs producing different outputs as a bug in the observability stack — it cannot be a bug in the system, because systems don't do that. LLM-backed systems do exactly that, every request, by design. An on-call rotation built without internalizing this will drift toward either paralysis (every stochastic wobble is a P2) or nihilism (the model is always weird, stop paging me).

AI-Assisted Incident Response: Giving Your On-Call Agent a Runbook

· 9 min read
Tian Pan
Software Engineer

Operational toil in engineering organizations rose to 30% in 2025 — the first increase in five years — despite record investment in AI tooling. The reason is not that AI failed. The reason is that teams deployed AI agents without the same rigor they use for human on-call: no runbooks, no escalation paths, no blast-radius constraints. The agent could reason about logs, but nobody told it what it was allowed to do.

The gap between "AI that can diagnose" and "AI that can safely mitigate" is not a model capability problem. It is a systems engineering problem. And solving it requires the same discipline that SRE teams already apply to human operators: structured runbooks, tiered permissions, and mandatory escalation points.

AI in the SRE Loop: What Works, What Breaks, and Where to Draw the Line

· 12 min read
Tian Pan
Software Engineer

Most production incidents don't fail because of missing tools. They fail because the person holding the pager doesn't have enough context fast enough. An engineer wakes up at 3 AM to a wall of firing alerts, spends the first 20 minutes piecing together what actually broke, another 20 minutes deciding which runbook applies, and by the time they're executing the fix, the incident has been open for nearly an hour. The raw fix might take 5 minutes.

AI can compress that context-gathering window from 40 minutes to under 2. That's the genuine value on the table. But "LLM helps your oncall" is not one product decision — it's a stack of decisions, each with its own failure mode, and some of those failure modes have consequences that a customer service chatbot hallucination doesn't.

The On-Call Burden Shift: How AI Features Break Your Incident Response Playbook

· 9 min read
Tian Pan
Software Engineer

Your monitoring dashboard is green. Latency is normal. Error rates are flat. And your AI feature has been hallucinating customer account numbers for the last six hours.

This is the new normal for on-call engineers at companies shipping AI features. The playbooks that worked for deterministic software — check the logs, find the stack trace, roll back the deploy — break down when "correct execution, wrong answer" is the dominant failure mode. A 2025 industry report found operational toil rose from 25% to 30% for the first time in five years, even as organizations poured millions into AI tooling. The tools got smarter, but the incidents got weirder.

SLOs for Non-Deterministic Systems: Defining Reliability When Every Response Is Different

· 8 min read
Tian Pan
Software Engineer

Your AI feature returns HTTP 200, completes in 180ms, and produces valid JSON. By every traditional SLI, the request succeeded. But the answer is wrong — a hallucinated product spec, a fabricated legal citation, a subtly incorrect calculation. Your monitoring is green. Your users are furious.

This is the fundamental disconnect that breaks SRE for AI systems. Traditional reliability engineering assumes a successful execution produces a correct result. Non-deterministic systems violate that assumption on every request. The same prompt, same context, same model version can produce a different — and differently wrong — answer each time.