Skip to main content

AI-Native API Design: Why REST Breaks When Your Backend Thinks Probabilistically

· 11 min read
Tian Pan
Software Engineer

Most backend engineers can recite the REST contract from memory: client sends a request, server processes it, server returns a status code and body. A 200 means success. A 4xx means the client did something wrong. A 5xx means the server broke. The response is deterministic, the timeout is predictable, and idempotency keys guarantee safe retries.

LLM backends violate every one of those assumptions. A 200 OK can mean your model hallucinated the entire response. A successful request can take twelve minutes instead of twelve milliseconds. Two identical requests with identical parameters will return different results. And if your server times out mid-inference, you have no idea whether the model finished or not.

Teams that bolt LLMs onto conventional REST APIs end up with a graveyard of hacks: timeouts that kill live agent tasks, clients that treat hallucinated 200s as success, retry logic that charges a user's credit card three times because idempotency keys weren't designed for probabilistic operations. This post walks through where the mismatch bites hardest and what the interface patterns that actually hold up in production look like.

The Synchronous Request-Response Model Was Built for Speed

REST was designed for fast, stateless operations. A database query completes in milliseconds. A file upload takes seconds at most. HTTP's default timeout of 30 seconds is generous for deterministic workloads.

LLM inference doesn't fit this model in two distinct ways. First, even simple text generation is slow by conventional standards — generating a 500-token response at 30 tokens per second takes 16 seconds. Second, agentic tasks that chain tool calls together can run for minutes or hours. An agent that researches a topic, writes code, runs tests, and iterates on failures might need 20 minutes of wall-clock time.

When a client's HTTP timeout fires at 30 seconds, it doesn't know whether the task was abandoned or is still running server-side. The connection drops, but the model keeps going. The client retries. Now two instances of the same agent task are running simultaneously, both potentially writing to the same database, calling the same external APIs, and sending duplicate emails.

The fix is an async job pattern that the major LLM APIs have converged on independently. The initial request returns immediately with 202 Accepted and a job ID. The client then polls a status endpoint or opens a streaming connection for updates. The job runs to completion regardless of client connectivity. This decoupling is the single most important structural change between a conventional REST API and one built for long-running AI workloads.

Status Codes Can't Capture Semantic Failures

HTTP status codes communicate infrastructure outcomes, not semantic ones. When a server returns 200 OK, it means the request was processed without a transport error. It says nothing about whether the content is correct.

This distinction didn't matter much for deterministic backends. If your API returns user data, the data is either there (200) or it isn't (404). But LLM backends can return syntactically correct responses that are semantically broken in ways HTTP has no vocabulary for.

Consider the failure modes that all return 200 OK:

  • Hallucination: The model invents API parameters, method names, or facts that don't exist. The JSON parses correctly. The schema validates. The data is entirely fabricated.
  • Refusal: The model declines to answer. You get back a polite message like "I can't help with that" instead of the structured output your application expected.
  • Schema drift: The model returns valid JSON, but uses snake_case where your schema expects camelCase, or omits a required field that it decided wasn't important.
  • Truncation: The model ran out of tokens mid-response. You get back valid JSON up to the point where it was cut off, then garbage or an abrupt end.

Clients that check only the HTTP status code miss all of these. The downstream effect is that applications treat hallucinated responses as ground truth, try to parse truncated JSON, and crash on missing fields — all while reporting zero API errors because every request returned 200.

The pattern that addresses this is a structured error envelope returned alongside the HTTP status code. Rather than relying on the status code alone, the response body carries a semantic status field:

{
"status": "partial_success",
"result": { ... },
"errors": [
{
"type": "schema_violation",
"message": "Field 'unit_price' missing from line_items[2]",
"severity": "warning",
"recovery_suggested": true
}
]
}

HTTP 200 means the request was processed. The status field in the body tells you whether the output is usable. This pattern lets clients make fine-grained decisions: log warnings for minor drift, retry on schema violations, escalate to humans on hallucination signals. It also makes your API honest about what it actually guarantees: infrastructure delivery, not semantic correctness.

Streaming Is Not Optional — But the Protocol Choice Matters

For text generation, streaming is the difference between an application that feels alive and one that looks frozen. Users can start reading at 200 milliseconds instead of waiting 16 seconds for the full completion. For long agentic tasks, streaming status updates is the only way to give users visibility into what's happening without requiring them to poll repeatedly.

The industry has landed on two protocols, each suited to different cases.

Server-Sent Events (SSE) is the default for token streaming. Every major LLM provider — OpenAI, Anthropic, Google — uses SSE for their streaming APIs. It runs over standard HTTP, works through load balancers and proxies without special configuration, and the browser's EventSource API handles reconnection automatically. Each chunk arrives as a structured event:

data: {"delta": {"type": "text_delta", "text": "The"}}
data: {"delta": {"type": "text_delta", "text": " answer"}}
data: [DONE]

SSE is unidirectional — server pushes to client only. That's sufficient for read-only use cases: chat, summaries, code generation, search results.

WebSocket adds bidirectional communication at the cost of more complex connection state management. Use it when the client needs to interrupt or modify an in-progress generation — for example, a voice interface where a user starts talking mid-response, or a collaborative editor where multiple users can redirect the model's output. Most teams default to SSE and reach for WebSocket only when they genuinely need client-to-server mid-stream messages.

The operational risk with SSE is connection management under failures. Proxies and CDNs that buffer responses will eat your stream. Load balancers that aggressively time out idle connections will cut streams during slow reasoning phases. Clients that don't implement reconnection will silently drop the connection and show a blank UI. These are engineering problems, not protocol problems — but they bite every team that ships streaming for the first time.

Idempotency Keys Were Designed for the Wrong Threat Model

The classic idempotency guarantee: if you send request X with idempotency key K, the server processes it once. If K arrives again, the server returns the cached result from the first execution. This is how payment APIs prevent double-charges when network failures cause retries.

The guarantee requires determinism. The same request must produce the same result every time, so the cached result is indistinguishable from a fresh execution. LLMs break this. Temperature > 0 means two identical prompts produce different text. Temperature = 0 with a fixed seed gets you most of the way there, but GPU floating-point non-determinism and load-balancing effects mean you still can't guarantee byte-identical outputs across runs.

This creates a real problem for agentic systems that perform state-changing tool calls. An agent that calls a "send email" tool three times (initial call plus two retries after network failures) might send three emails if the tool isn't separately guarded. The idempotency key on the LLM request doesn't prevent the tool call from executing multiple times.

The pattern that works is applying idempotency at the tool layer, not the LLM layer. Each tool call that performs a write operation gets its own idempotency key, derived from the agent task ID and a call-site hash. The tool executor deduplicates at the execution level before touching external systems. The LLM layer can generate non-deterministically and retry freely; the tool layer absorbs the idempotency guarantee for state-changing operations.

For read-only tool calls, idempotency is less critical — retry them freely. For write operations (database mutations, emails, payments, API calls with side effects), enforce a strict "execute once" guarantee at the tool layer regardless of how many times the LLM requests the same call.

Long-Running Tasks Require a Different API Shape

The async job pattern decouples submission from execution, but the client-facing API still needs a coherent shape for the three points in a long task's lifecycle: submission, status, and completion.

Submission returns immediately. POST /tasks accepts the task description and any configuration, then returns 202 Accepted with a task ID and an estimated duration if available. The client now has a handle and can disconnect — the task will run to completion.

Status has three viable options depending on your use case:

  • Polling (GET /tasks/{id}) is the simplest. Clients poll on a schedule and receive {"status":"running","progress":0.4} until the task finishes. Add a Retry-After header to suggest polling intervals and reduce unnecessary requests. Good for batch use cases where real-time updates aren't needed.
  • Webhooks let the server notify the client when status changes, eliminating polling overhead. The client provides a callback URL at submission time; the server calls it at each significant state transition. Good for server-to-server integrations where the client can expose a public endpoint.
  • SSE status stream (GET /tasks/{id}/stream) is the best experience for interactive clients. The client opens a persistent connection after submission and receives real-time updates as the agent progresses through steps. This combines the immediacy of webhooks with the simplicity of client-initiated connections.

Completion should include the full result, the final status, and enough metadata to diagnose failures: which steps ran, which tools were called, how long each took, and what errors occurred. A task that "failed" needs enough context in the response for the calling system to decide whether to retry, escalate, or surface an error to the user.

Versioning When Your Output Distribution Is What Changed

Versioning a deterministic API is straightforward: change the interface, bump the version, maintain backward compatibility. Clients pin to a version and stay there until they're ready to migrate.

LLM APIs add a new versioning problem: the output distribution of a given model changes over time even when the interface stays the same. Providers retrain and update models without always issuing a new version identifier. Your prompts, which were tuned against a specific model behavior, now run against a subtly different model.

The practical implication is that version pinning must go deeper than the API version. Pin to specific model checkpoints when they're available (providers like Anthropic and OpenAI expose dated model identifiers for this reason). Run behavioral regression tests against pinned checkpoints before migrating to a newer one. The regression tests aren't binary pass/fail — they're statistical thresholds: JSON validity rate, schema adherence rate, semantic quality score. A model that hallucinates 2% of the time is not the same as one that hallucinates 8% of the time, and the difference won't show up in unit tests.

For SLA commitments, this means shifting from "API returns valid JSON" to "API returns valid JSON ≥ 97% of the time across a 24-hour measurement window." This is uncomfortable for engineers used to deterministic guarantees, but it's honest. Statistical SLAs are the only commitments you can actually stand behind for probabilistic backends.

Design for Uncertainty from the First Interface

The teams that get AI-native API design right don't treat it as a patch on top of REST conventions. They start from the failure modes and work backward to the interface.

Long-running tasks require async submission from day one, because retrofitting polling onto a synchronous endpoint is a breaking change. Semantic error envelopes need to be in the initial schema, because clients that aren't checking them won't start checking them after launch. Streaming requires infrastructure support (SSE-compatible load balancers, connection timeout configuration) that's much harder to add after deployment. Idempotency at the tool layer needs to be built before the first write operation goes to production.

The underlying mental model shift is this: REST APIs are designed around what the server knows it will deliver. AI-native APIs are designed around what the server cannot guarantee — and they make that uncertainty explicit in every part of the interface. Status envelopes, confidence metadata, probabilistic SLAs, and async patterns aren't complexity for its own sake. They're the contract a probabilistic backend can actually honor.

References:Let's stay in touch and Follow me for more thoughts and updates