Skip to main content

The LLM Request Lifecycle Is a State Machine — Treat It Like One

· 9 min read
Tian Pan
Software Engineer

Most teams treat LLM request handling as a linear function: call the API, check for an exception, maybe retry once, return the result. In practice it's nothing like that. Between the moment a user triggers an LLM call and the moment a response reaches their screen, a request can traverse a dozen implicit states — attempting primary provider, waiting for backoff, switching to fallback, validating output, retrying with refined prompt — without any of those transitions being recorded or visible.

The result is debugging that happens after the fact from logs scattered across services, with no authoritative answer to "what did this request actually do?" Treating the LLM request lifecycle as an explicit finite state machine is the architectural move that makes that question answerable without archaeological work.

Why Implicit States Are a Production Liability

Consider a typical LLM call wrapper in production. It probably has:

  • A retry loop with exponential backoff for transient errors
  • A fallback to a secondary model when the primary returns a 503 or hits a rate limit
  • A validation step that checks whether the output matches the expected schema
  • A re-prompt path that rewrites the request when validation fails

Each of these is a state. Each transition between them is an event. But most implementations encode all of this as nested if-else logic inside a single function, with ad hoc logging that captures some transitions and misses others.

The practical consequences show up in three recurring failure modes:

Silent fallbacks. The primary model fails, the system routes to the secondary, and the user gets a response — but the response has subtly worse quality, higher latency, or higher cost than usual. Because no one logged the fallback transition explicitly, this pattern is invisible in dashboards until someone notices something wrong downstream.

Retry storms in disguise. When a provider experiences degraded performance rather than hard errors, retries succeed on the third or fourth attempt and appear as "slow requests" rather than "failures." The system looks fine from the outside. But aggregate retry costs accumulate, and the 90th percentile latency increases silently over weeks.

Validation failures misclassified as successes. The API returns HTTP 200 with a response that fails schema validation. The wrapper re-prompts once, gets a valid response, and returns it. The caller sees a success. The fact that this request required two LLM calls instead of one, took twice as long, and cost twice as much is unrecorded.

All three of these failures have the same root cause: the intermediate states are invisible.

Mapping the Actual States

A useful model defines eight states for any LLM request:

  1. PENDING — the request is queued but hasn't been dispatched
  2. DISPATCHED — the request has been sent to a specific provider/model
  3. AWAITING_RESPONSE — waiting for the stream to begin or the response to arrive
  4. VALIDATING — the response arrived; output is being checked
  5. RETRYING — the previous attempt failed; backoff timer is running
  6. FALLING_BACK — the primary path is considered unavailable; routing to secondary
  7. CIRCUIT_OPEN — the circuit breaker has tripped; requests fail fast without attempting the provider
  8. TERMINAL — the request reached a final outcome (SUCCESS, VALIDATION_FAILURE, EXHAUSTED, or DEGRADED)

The transitions between these states are events, not log messages. Each transition has a cause (what triggered it), a target state (where we're going), and a cost (latency added, tokens consumed, provider change).

What makes this machine explicit rather than implicit is recording every state entry and exit as a structured event, not as a side effect of log.info scattered through the retry loop.

Separating Three Distinct Failure Categories

One reason implicit implementations get complicated is that engineers conflate three failure types that actually require different handling:

Transient infrastructure failures are short-lived and self-resolving: rate limit 429s, brief 503s, network timeouts, TLS handshakes that time out under load. The correct response is to wait and retry the same provider. The wait time should start at one to two seconds, double with each attempt, and include random jitter to avoid thundering herds. A reasonable cap is three to five retries before escalating.

Provider degradation is the failure mode that retries cannot solve and circuit breakers exist to address. It shows up as rising timeout rates before error rates increase — P99 latency spikes while P50 stays normal, or a gradual rise in requests that time out at the 30-second hard limit. Continuing to send requests to a degrading provider wastes budget and worsens latency without improving success rates. The circuit breaker tracks failure rate over a rolling window (a common threshold is 50% failures over 60 seconds) and, when the threshold is exceeded, moves the provider to CIRCUIT_OPEN state. New requests skip the provider entirely until a probe request confirms recovery.

Quality failures are the ones that get missed most often. They occur when the provider returns HTTP 200 but the output fails validation — wrong schema, truncated content, hallucinated values in constrained fields, empty required arrays. Retrying the same prompt against the same model rarely helps here; the failure is deterministic for that prompt. The correct response is to move to a re-prompt variant (a semantically equivalent but differently structured request) or to escalate to a more capable model. Both paths need to be represented as explicit transitions, not as an undocumented second call hidden inside the validation function.

Making Every Transition Observable

An explicit state machine produces a trace for every request that answers the following questions without requiring log archaeology:

  • Which provider was attempted first, and in what order did subsequent attempts occur?
  • How many retry attempts were made before success or escalation?
  • At what latency did each attempt time out or complete?
  • Did the request require re-prompting due to validation failure? How many times?
  • Was the circuit breaker open for this provider at the time of routing?
  • What was the terminal outcome and what caused it?

This trace is a structured object, not a series of log lines. It has a request ID, a sequence of state transitions each with a timestamp and a reason code, and a terminal state with outcome metadata (tokens consumed, total latency, providers attempted, fallback depth).

The observability value is cumulative. A single trace tells you what happened to one request. Aggregating traces tells you what's happening to the system. Rising fallback rates indicate primary provider degradation before errors surface in dashboards. Rising average fallback depth (how far down the fallback chain requests travel before succeeding) indicates systemic pressure. A distribution of re-prompt counts shows which prompt templates have structural output quality issues.

None of this is visible in a system where retry and fallback are implicit.

The Implementation Principle: Separate Execution from Observation

A practical trap when building this is coupling the state recording logic to the execution logic. If you add trace.add_event("retrying") inside the retry loop, you've made the observability code harder to test, harder to replace, and entangled with the business logic it's meant to observe.

The cleaner approach is to implement the state machine as a separate layer that drives the execution but doesn't perform it. The machine knows which state the request is in and what transitions are legal. The execution layer calls the provider. The machine records the transition and decides the next state based on the outcome.

This separation also makes the retry policy configurable independently of the observation policy, which matters when you want different retry budgets for different request types (interactive user-facing queries versus background batch jobs) without duplicating the instrumentation.

One implication worth noting from production experience: synchronous logging inside the request path is dangerous. Logging each state transition to a database inline with the request doubles latency at the p99 before teams notice the issue. Observability events should be emitted asynchronously to a queue or stream, processed separately from the critical path.

What You Actually Gain

Teams that implement this pattern consistently report the same benefits in the same order:

First, mean time to debug drops significantly. When a request behaves unexpectedly, the trace shows exactly which states it traversed and why, without guesswork. The question "why was this response late and low quality?" gets answered in seconds rather than requiring a cross-service log join.

Second, cost attribution becomes possible. Because every LLM call is tagged with its cause (first attempt, retry N, fallback at depth M, re-prompt due to validation failure), token spend can be decomposed. Teams typically find that five to fifteen percent of total token costs are retry and re-prompt overhead — overhead that was previously invisible and therefore unoptimizable.

Third, provider reliability becomes measurable per request type. A model that has high error rates on short creative completions but excellent reliability on structured extraction tasks will show this pattern in aggregated traces, enabling routing decisions grounded in empirical data rather than vendor SLA numbers.

The state machine framing forces one design decision that has outsized value: every request must have a terminal state, and every terminal state must have a reason. Systems that fail silently or loop indefinitely cannot exist in a model where every request must exit through a defined terminal outcome. That constraint alone eliminates entire categories of production bugs — the kind where requests never quite fail but also never quite succeed, consuming resources while producing nothing.

The Boring Reality of Resilience

The actual implementation — a state enum, a transition table, a structured event emitter, and a trace aggregator — is not complex. The patterns are established. Exponential backoff with jitter, circuit breakers with half-open probe states, hard timeouts, fallback chains, re-prompt on validation failure: none of these are novel ideas.

What's consistently missing isn't the mechanism but the model. Engineers implement retry logic and call it done. They don't ask "what state is this request in?" because they didn't define the states. The result is resilience that works under happy-path conditions but becomes undebuggable at the specific moment it matters most — when something is going wrong in production and you need to understand why.

Treat the request lifecycle as a state machine. Every state has a name. Every transition has a reason. Every request has a terminal outcome. That discipline is the difference between resilience you can reason about and resilience that only works until you need to explain it.

References:Let's stay in touch and Follow me for more thoughts and updates