Skip to main content

The LLM Request Lifecycle Your try/catch Is Missing

· 10 min read
Tian Pan
Software Engineer

The most dangerous failure your LLM stack can produce returns HTTP 200. The JSON parses. Your schema validation passes. No exception is raised. And the response is completely wrong — wrong facts, wrong structure, truncated mid-sentence, or fabricated from whole cloth.

A single try/catch around an LLM API call handles the easy failures: rate limits, server errors, network timeouts. These are the visible failures. The invisible ones — a model that hit its token limit and stopped mid-answer, an agent that looped 21 extra tool calls before finding the right parameter name, a validation retry that inflated your costs by 37% — produce no exceptions. They produce results.

The fix is not better error handling. It is modeling the LLM request lifecycle as an explicit state machine, where every state transition emits an observable span, and failure modes are first-class states rather than buried exception handlers.

What the LLM Request Lifecycle Actually Looks Like

Most production LLM requests are not a single API call. They are a pipeline of at least seven distinct phases, each with its own failure modes:

Routing — Before a token reaches a model, a routing layer selects which model and provider to use, based on task complexity, latency requirements, cost tier, and provider health. The routing decision is a distinct state that needs to be tracked: which rule fired, which model was selected, and why.

Request preparation — Context window budgeting, prompt template selection, tool schema injection, conversation history trimming. Failure here is silent: exceeding the context window produces a 400 error (non-retryable), or worse, if the token count estimate is off, the request gets truncated without any indication.

Primary generation — The actual LLM call. This phase contains its own sub-states: connection establishment, queue wait time (which determines time-to-first-token), and the decode phase (time-per-output-token). These are different problems requiring different optimizations, but they're invisible if you only measure total request duration.

Response parsing and schema validation — Raw LLM output is not trusted output. For structured pipelines, this means JSON parsing, schema validation, and semantic validation. This phase is itself a mini state machine: GENERATED → VALIDATED → (VALID | INVALID → REPAIR_PROMPT → GENERATED).

Retry loop — On rate limits, server errors, or validation failures: exponential backoff with jitter, provider failover, or prompt mutation retry. Without explicit state tracking, every retry attempt is invisible — you only see the final success or failure.

Fallback routing — When primary generation fails: secondary provider → smaller model → cached response → rule-based degraded response. Each hop in the fallback chain is a distinct state with its own latency and cost profile.

Escalation — After retry and fallback exhaustion: human queue, task deferral, or fail-safe degraded response with user notification. In most codebases, escalation is a catch clause, not a named state.

When you treat this pipeline as a single opaque function call, you can measure its outcome — success or failure — but you cannot measure what happened inside.

The States Your Current Error Handling Misses

The failures that try/catch handles are the easy ones: 429 (rate limit), 500 (server error), 502/503/504 (gateway errors), network timeouts. These throw exceptions. They appear in error logs. They trigger alerts.

The failures that cause real production problems do not throw exceptions:

finish_reason: "length" — The model hit its token limit mid-answer and stopped. The HTTP response is 200. The JSON parses. Without explicitly checking finish_reason, you serve a truncated response as complete. This is not theoretical: a known bug in at least one major gateway library silently drops tool call responses when this condition occurs.

Silent retry inflation in agents — A documented case study involving an AI travel agent showed the system making 49 LLM calls for a task requiring approximately 28. Every call returned 200. The root cause was a parameter name mismatch — the agent tried camelCase, lowercase, then underscore variants in sequence, each failing at the semantic layer. The agent eventually produced a correct itinerary. There was no error. At 1,000 daily runs, the compounding cost of those 21 extra calls per task amounts to roughly $9,271 in avoidable annual spend.

Validation semantic drift — A response passes JSON schema validation but the values are wrong. A confidence_score field contains 0.99 — not computed, hallucinated. The field exists, the type is correct, the value is wrong. This requires semantic validation as a distinct lifecycle state. JSON schema validation is necessary but not sufficient.

Hallucinated success — Air Canada's chatbot invented a bereavement fare refund policy that did not exist. The API returned 200. The response was grammatically correct and structured consistently with real policies. A tribunal later held Air Canada responsible for its chatbot's output. OpenAI's Whisper transcription tool, used in medical settings, was found to fabricate phrases in approximately 1% of samples — nearly 40% of which were clinically harmful. These are not infrastructure failures. They are semantic failures dressed as successes.

Provider degradation without failure — Anthropic's Claude API saw its error rate climb from 3.2% in June 2025 to 11.7% in September 2025. Systems that measured total uptime saw this as a reliability event. Systems that measured only the requests that actually failed saw even less. Systems with multi-provider routing saw effective availability of 99.7% across the same period — because they knew, via circuit breaker state, when to stop sending traffic to a degrading endpoint before each individual request failed.

Modeling the Lifecycle as a State Machine

An explicit state machine for LLM requests looks like this:

IDLE → ROUTING → PREPARING → CALLING_PRIMARY
↓ success ↓ 429/5xx
VALIDATING BACKING_OFF → CALLING_PRIMARY (max N)
↓ valid ↓ invalid ↓ exhausted
COMPLETE REPAIR_RETRY → CALLING_PRIMARY CALLING_FALLBACK
↓ success
VALIDATING → COMPLETE
↓ exhausted
CALLING_CACHE
↓ hit ↓ miss
DEGRADED_RESPONSE ESCALATING → HUMAN_QUEUE

This model has several important properties that a try/catch approach does not:

Illegal transitions are prevented. You cannot go from VALIDATING back to ROUTING. You cannot go from CALLING_CACHE to CALLING_PRIMARY. Making legal transitions explicit makes illegal ones impossible to reach accidentally.

Every state is named. ESCALATING is not a catch block — it is a first-class state with its own span, its own metrics, and its own alert threshold. The difference matters: you cannot alert on a catch block frequency without instrumentation; you can set a Prometheus counter on a named state.

State persistence becomes straightforward. Checkpointing a state machine means recording the current state plus the accumulated context. If an agent is in CALLING_FALLBACK when the server restarts, it can resume from that state rather than restarting the entire request. Frameworks like LangGraph implement this: checkpointing to Redis or Postgres after every node, so agents survive restarts without re-doing completed work.

The retry budget is bounded and visible. A state machine has an explicit maximum retry count for each state. Retry exhaustion is a named transition to CALLING_FALLBACK or ESCALATING, not an unbounded loop that terminates when an exception propagates far enough up the call stack.

Applied to SQL generation tasks in a research benchmark, an FSM-based orchestration approach improved GPT-3.5 task success rates from 50.7% to 63.7% while reducing token costs by 5x — because explicit states enabled targeted recovery strategies at each failure point rather than a generic retry of the entire task.

Instrumenting Each State Transition with OpenTelemetry

The OpenTelemetry GenAI Semantic Conventions, launched in 2024 and now natively supported by Datadog, Honeycomb, and New Relic, define four span types for LLM operations: inference, embeddings, retrieval, and execute_tool. Each should be a child span of a root trace that covers the full request lifecycle.

The standard attributes include gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons, and gen_ai.request.model. The finish_reasons attribute is particularly important: tracking "length" as a distinct finish reason, not just "stop", surfaces the truncation problem described earlier.

What the standard does not yet define are the attributes for retry and fallback state. These require custom attributes:

  • llm.retry.attempt — the current retry number (0 = first attempt)
  • llm.retry.reasonrate_limit | server_error | validation_failure | timeout
  • llm.fallback.level — 0 (primary), 1 (first fallback), 2 (cache), 3 (degraded)
  • llm.fallback.trigger — what caused the fallback transition
  • llm.circuit_breaker.stateCLOSED | OPEN | HALF_OPEN
  • llm.validation.attempt — number of validate-repair-retry cycles
  • llm.validation.error — the specific schema validation error message

These attributes transform your observability from "did this request succeed?" to "which states did this request pass through, how long did it spend in each, and which failure mode triggered each transition?"

The most useful derived metric from this instrumentation: retry latency as a fraction of total request latency. If 40% of your p95 latency is retry wait time, the problem is provider reliability, not model latency. If 15% is validation repair cycles, the problem is prompt or schema design. These are different problems that look identical in a total duration histogram.

Practical Alerting Thresholds for State-Based Monitoring

Once state transitions are instrumented, you can alert on meaningful signals instead of lagging indicators:

Retry rate by reason — Alert when retry rate exceeds 5% on any request class. A sustained validation_failure retry rate above 1% per prompt template is a signal that the prompt or schema needs redesign, not that you need to add more retries.

Fallback activation rate — Alert when fallback activation exceeds 2%. At that rate, your primary provider's reliability is meaningfully impacting users, and circuit breaker configuration needs review.

Circuit breaker state changes — Alert immediately when a circuit breaker transitions to OPEN. The lag between a provider starting to degrade and your circuit breaker opening is the window during which users experience failures. Tracking when and why circuit breakers open is the fastest path to reducing that window.

finish_reason: "length" frequency — Alert when this exceeds 0.5% of responses. At that rate, you have a systemic token budget misconfiguration that is silently truncating a meaningful fraction of user-facing responses.

Escalation rate — Alert when requests reach ESCALATING at more than 0.1% of total volume. Escalation to human review or degraded response is a last resort; it should be rare. A rising escalation rate is the earliest signal that your retry and fallback configuration is undersized for current provider reliability.

The observability wins from state-machine tracing extend beyond alerting. With spans at each state transition, you can measure retry latency separately from generation latency, track which fallback tier is contributing the most to your monthly token bill, and identify which prompt templates consistently require validation repair cycles. These signals are invisible to monitoring that treats the LLM call as an opaque function.

Closing the Gap

The gap between "our API calls are succeeding" and "our product is working" in LLM applications is larger than in most software systems, because so many failure modes produce valid-looking outputs. Closing that gap requires three things: a model of the lifecycle that names every state, instrumentation that emits a span at every state transition, and alerts that fire on state-level signals before they aggregate into user-visible failures.

The state machine does not need to be complex. Even a minimal version — distinguishing CALLING_PRIMARY from RETRYING from CALLING_FALLBACK from DEGRADED_RESPONSE — is dramatically more observable than a single try/catch. The cost of adding these states is one span per transition. The benefit is knowing, for every request that "succeeded," what it actually went through to get there.

References:Let's stay in touch and Follow me for more thoughts and updates