Skip to main content

The LLM Request Lifecycle Your try/catch Is Missing

· 10 min read
Tian Pan
Software Engineer

The most dangerous failure your LLM stack can produce returns HTTP 200. The JSON parses. Your schema validation passes. No exception is raised. And the response is completely wrong — wrong facts, wrong structure, truncated mid-sentence, or fabricated from whole cloth.

A single try/catch around an LLM API call handles the easy failures: rate limits, server errors, network timeouts. These are the visible failures. The invisible ones — a model that hit its token limit and stopped mid-answer, an agent that looped 21 extra tool calls before finding the right parameter name, a validation retry that inflated your costs by 37% — produce no exceptions. They produce results.

The fix is not better error handling. It is modeling the LLM request lifecycle as an explicit state machine, where every state transition emits an observable span, and failure modes are first-class states rather than buried exception handlers.

What the LLM Request Lifecycle Actually Looks Like

Most production LLM requests are not a single API call. They are a pipeline of at least seven distinct phases, each with its own failure modes:

Routing — Before a token reaches a model, a routing layer selects which model and provider to use, based on task complexity, latency requirements, cost tier, and provider health. The routing decision is a distinct state that needs to be tracked: which rule fired, which model was selected, and why.

Request preparation — Context window budgeting, prompt template selection, tool schema injection, conversation history trimming. Failure here is silent: exceeding the context window produces a 400 error (non-retryable), or worse, if the token count estimate is off, the request gets truncated without any indication.

Primary generation — The actual LLM call. This phase contains its own sub-states: connection establishment, queue wait time (which determines time-to-first-token), and the decode phase (time-per-output-token). These are different problems requiring different optimizations, but they're invisible if you only measure total request duration.

Response parsing and schema validation — Raw LLM output is not trusted output. For structured pipelines, this means JSON parsing, schema validation, and semantic validation. This phase is itself a mini state machine: GENERATED → VALIDATED → (VALID | INVALID → REPAIR_PROMPT → GENERATED).

Retry loop — On rate limits, server errors, or validation failures: exponential backoff with jitter, provider failover, or prompt mutation retry. Without explicit state tracking, every retry attempt is invisible — you only see the final success or failure.

Fallback routing — When primary generation fails: secondary provider → smaller model → cached response → rule-based degraded response. Each hop in the fallback chain is a distinct state with its own latency and cost profile.

Escalation — After retry and fallback exhaustion: human queue, task deferral, or fail-safe degraded response with user notification. In most codebases, escalation is a catch clause, not a named state.

When you treat this pipeline as a single opaque function call, you can measure its outcome — success or failure — but you cannot measure what happened inside.

The States Your Current Error Handling Misses

The failures that try/catch handles are the easy ones: 429 (rate limit), 500 (server error), 502/503/504 (gateway errors), network timeouts. These throw exceptions. They appear in error logs. They trigger alerts.

The failures that cause real production problems do not throw exceptions:

finish_reason: "length" — The model hit its token limit mid-answer and stopped. The HTTP response is 200. The JSON parses. Without explicitly checking finish_reason, you serve a truncated response as complete. This is not theoretical: a known bug in at least one major gateway library silently drops tool call responses when this condition occurs.

Silent retry inflation in agents — A documented case study involving an AI travel agent showed the system making 49 LLM calls for a task requiring approximately 28. Every call returned 200. The root cause was a parameter name mismatch — the agent tried camelCase, lowercase, then underscore variants in sequence, each failing at the semantic layer. The agent eventually produced a correct itinerary. There was no error. At 1,000 daily runs, the compounding cost of those 21 extra calls per task amounts to roughly $9,271 in avoidable annual spend.

Validation semantic drift — A response passes JSON schema validation but the values are wrong. A confidence_score field contains 0.99 — not computed, hallucinated. The field exists, the type is correct, the value is wrong. This requires semantic validation as a distinct lifecycle state. JSON schema validation is necessary but not sufficient.

Hallucinated success — Air Canada's chatbot invented a bereavement fare refund policy that did not exist. The API returned 200. The response was grammatically correct and structured consistently with real policies. A tribunal later held Air Canada responsible for its chatbot's output. OpenAI's Whisper transcription tool, used in medical settings, was found to fabricate phrases in approximately 1% of samples — nearly 40% of which were clinically harmful. These are not infrastructure failures. They are semantic failures dressed as successes.

Provider degradation without failure — Anthropic's Claude API saw its error rate climb from 3.2% in June 2025 to 11.7% in September 2025. Systems that measured total uptime saw this as a reliability event. Systems that measured only the requests that actually failed saw even less. Systems with multi-provider routing saw effective availability of 99.7% across the same period — because they knew, via circuit breaker state, when to stop sending traffic to a degrading endpoint before each individual request failed.

Modeling the Lifecycle as a State Machine

An explicit state machine for LLM requests looks like this:

IDLE → ROUTING → PREPARING → CALLING_PRIMARY
↓ success ↓ 429/5xx
VALIDATING BACKING_OFF → CALLING_PRIMARY (max N)
↓ valid ↓ invalid ↓ exhausted
COMPLETE REPAIR_RETRY → CALLING_PRIMARY CALLING_FALLBACK
↓ success
VALIDATING → COMPLETE
↓ exhausted
CALLING_CACHE
↓ hit ↓ miss
DEGRADED_RESPONSE ESCALATING → HUMAN_QUEUE
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates