Skip to main content

17 posts tagged with "api-design"

View all tags

Hyrum's Law for Streamed Reasoning: Pacing, Pauses, and Intermediate Tokens Are an Undocumented Contract

· 11 min read
Tian Pan
Software Engineer

A team upgrades from a frontier model to its faster successor. The eval suite is green. Final answers match. Tool-call schemas are identical. The structured outputs validate against the same JSON schema they always did. They ship. Within a day, support tickets pile up: "the assistant feels rushed," "it's not really thinking anymore," "something is off." The product manager pulls telemetry and finds task-completion rates unchanged. The engineering team double-checks the eval and the schema and finds nothing wrong. The complaint is real, but the contract — as the team defined it — is intact.

What changed is the texture of the stream. The old model paused for 800 milliseconds before calling a tool, emitted a "Let me check that..." preamble, and dribbled tokens at roughly 35 per second with natural-feeling clusters around clause boundaries. The new model emits tokens at 90 per second, never pauses, and skips the preamble entirely. None of that was in any documented contract. All of it was load-bearing.

This is Hyrum's law, and streaming makes its surface area enormous. Any observable behavior of your system will be depended on by somebody — and a streaming AI surface exposes far more observable behavior than the team realizes.

API Documentation Is Reliability Infrastructure: How Your Docs Determine Agent Success Rates

· 10 min read
Tian Pan
Software Engineer

Most engineering teams think of API documentation as a developer experience concern — something you improve to reduce support tickets and onboarding time. That framing made sense when your primary consumer was a human reading docs in a browser. It is no longer adequate.

When an AI agent calls your API via tool use, your documentation stops being a guide and becomes runtime behavior. A vague parameter description isn't a UX inconvenience — it is a direct instruction to the model that produces hallucinated values. A missing error code isn't a gap in your reference docs — it is an ambiguous signal that can send an agent into a retry loop with no exit condition. The documentation you wrote three years ago for a human audience is now being parsed by a stateless language model that will execute confidently regardless of whether it understood correctly.

Agent as User: Why Your Product Analytics Break When Bots Become Your Power Users

· 10 min read
Tian Pan
Software Engineer

Automated internet traffic grew 23.5% year-over-year in 2025 — eight times faster than human traffic. Agent-driven interactions alone grew 7,851%. If you're building a product that handles any meaningful volume of API traffic, there's a reasonable chance your heaviest "users" are not human. The uncomfortable truth is that your product analytics almost certainly have no idea.

This isn't a bot detection problem. It's an instrumentation architecture problem. When an AI agent books travel, files expense reports, queries your database, or calls your payment API, it leaves a completely different behavioral signature than a human doing the same thing — and your session funnels, NPS surveys, and cohort retention charts are quietly telling you lies.

AI-Native API Design: Building Backends That Agents Can Actually Use

· 10 min read
Tian Pan
Software Engineer

Your REST API works fine. Documentation is thorough. Error codes are consistent. Every human-authored client you've ever tested handles it well. Then your team integrates an AI agent and within an hour it's generated 2,000 failed requests by retrying variations of an endpoint that doesn't exist — bulk_search_users, search_all_users, bulk_user_search — each attempt triggering real downstream processing.

This isn't a prompt engineering failure. It's an API design failure.

REST APIs were built for clients that parse documentation, respect contracts, and call exactly what's specified. AI agents are different: they reason about what an endpoint probably does based on names and descriptions, retry without tracking state, and treat error messages as instructions rather than diagnostic codes. Designing an API for an agentic caller requires rethinking assumptions that most backend engineers have never had to question.

Stale Tool Descriptions Are Your Agent's Biggest Silent Failure

· 9 min read
Tian Pan
Software Engineer

You ship a tool that lets your agent fetch user profiles. The description reads: "Retrieves user information by user ID." Six weeks later, the backend team renames user_id to customer_uuid and adds a required tenant_id field. Nobody updates the tool schema. Your agent keeps calling the old signature, gets back a 400, interprets the empty result as "no user found," and helpfully creates a duplicate record.

No error in the logs. No alert fired. The agent was confident the whole time.

This is the tool documentation problem: schema drift that turns stale descriptions into silent failure vectors. It is probably the most underappreciated reliability hazard in production AI systems today, and it gets worse the longer your agent lives.

Prompt Deprecation Contracts: Why a Wording Cleanup Is a Breaking Change

· 9 min read
Tian Pan
Software Engineer

A four-word edit on a system prompt — "respond using clean JSON" replacing "output strictly valid JSON" — once produced no eval movement, shipped on a Thursday, and was rolled back at 4am Friday after structured-output error rates went from 0.3% to 11%. The prompt did not get worse. It got different, and the parsers downstream of it had been pinned, without anyone noticing, to the literal phrase "strictly valid."

This is the failure mode that most prompt-engineering teams have not yet built tooling for: the prompt was treated as text the author owned, when it was in fact a contract with consumers the author never met. Some of those consumers are other prompts that quote the original verbatim. Some are tool descriptions whose JSON schema fields anchor on a particular adjective. Some are evals whose rubrics ask the judge to check for "the strictly valid format." And some are parsers — the most brittle category — whose regexes were calibrated to the exact preamble the model used to emit.

A "small wording cleanup" silently breaks parsers, shifts judge calibration, and invalidates weeks of eval runs. None of these failures show up on the PR. All of them show up on the dashboard a week later as drift.

Agent Traffic Is Not Human Traffic: Designing APIs for Two Species of Caller

· 11 min read
Tian Pan
Software Engineer

The API you shipped two years ago was designed for a single species of caller: a person, behind a browser or a mobile client, clicking once and waiting for a response. That assumption is now wrong on roughly half of every interesting endpoint. The other half of the traffic is agents — your own, your customers', third-party integrations using your endpoints as tools — and they have different physics. They burst. They retry forever. They parallelize. They parse error strings literally. They act on behalf of a human who will not be available to clarify intent when something breaks.

Most of the production weirdness landing in postmortems this year traces back to one architectural mistake: treating both species as the same caller class. Rate limits sized for human pacing get blown apart by an agent's parallel fanout. Error messages designed to be human-readable get parsed wrong by an agent that retries forever on a 400. Idempotency assumptions that humans satisfy by default get violated when an agent retries the same payload from a recovered checkpoint. Auth logs lose the ability to distinguish "the user did this" from "the user's agent did this on the user's behalf."

The fix is not a smarter WAF or a bigger rate-limit bucket. It is a deliberate API design that names two caller classes, treats their traffic as different shapes, and records the delegation chain so accountability survives the indirection.

Conversation State Is Not a Chat Array: Multi-Turn Session Design for Production

· 10 min read
Tian Pan
Software Engineer

Most multi-turn LLM applications store conversation history as an array of messages. It works fine in demos. It breaks in production in ways that take days to diagnose because the failures look like model problems, not infrastructure problems.

A user disconnects mid-conversation and reconnects to a different server instance—session gone. An agent reaches turn 47 in a complex task and the payload quietly exceeds the context window—no error, just wrong answers. A product manager asks "can we let users try a different approach from step 3?"—and the engineering answer is "no, not with how we built this." These are not edge cases. They are the predictable consequences of treating conversation state as a transient array rather than a first-class resource.

The ORM Impedance Mismatch for AI Agents: Why Your Data Layer Is the Real Bottleneck

· 9 min read
Tian Pan
Software Engineer

Most teams building AI agents spend weeks tuning prompts and evals, benchmarking model choices, and tweaking temperature — while their actual bottleneck sits one layer below: the data access layer that was designed for human developers, not agents.

The mismatch isn't subtle. ORMs like Hibernate, SQLAlchemy, and Prisma, combined with REST APIs that return paginated, single-entity responses, produce data access patterns exactly wrong for autonomous AI agents. The result is token waste, rate limit failures, cascading N+1 database queries, and agents that hallucinate simply because they can't afford to load the context they need.

This post is about the structural problem — and what an agent-optimized data layer actually looks like.

AI-Native API Design: Why REST Breaks When Your Backend Thinks Probabilistically

· 11 min read
Tian Pan
Software Engineer

Most backend engineers can recite the REST contract from memory: client sends a request, server processes it, server returns a status code and body. A 200 means success. A 4xx means the client did something wrong. A 5xx means the server broke. The response is deterministic, the timeout is predictable, and idempotency keys guarantee safe retries.

LLM backends violate every one of those assumptions. A 200 OK can mean your model hallucinated the entire response. A successful request can take twelve minutes instead of twelve milliseconds. Two identical requests with identical parameters will return different results. And if your server times out mid-inference, you have no idea whether the model finished or not.

Teams that bolt LLMs onto conventional REST APIs end up with a graveyard of hacks: timeouts that kill live agent tasks, clients that treat hallucinated 200s as success, retry logic that charges a user's credit card three times because idempotency keys weren't designed for probabilistic operations. This post walks through where the mismatch bites hardest and what the interface patterns that actually hold up in production look like.

API Design for AI-Powered Endpoints: Versioning the Unpredictable

· 8 min read
Tian Pan
Software Engineer

Your /v1/summarize endpoint worked perfectly for eighteen months. Then you upgraded the underlying model. The output format didn't change. The JSON schema was identical. But your downstream consumers started filing bugs: the summaries were "too casual," the bullet points were "weirdly specific," the refusals on edge cases were "different." Nothing broke in the traditional sense. Everything broke in the AI sense.

This is the versioning problem that REST and GraphQL were never designed to solve. Traditional API contracts assume determinism: the same input always produces the same output. An AI endpoint's contract is probabilistic — it includes tone, reasoning style, output length distribution, and refusal thresholds, all of which can drift when you swap or update the underlying model. The techniques that work for database-backed APIs are necessary but not sufficient for AI-backed ones.

Behavioral SLAs for AI-Powered APIs: Writing Contracts for Non-Deterministic Outputs

· 10 min read
Tian Pan
Software Engineer

Your payment service has a 99.9% uptime SLA. Requests either succeed or fail with a documented error code. When something breaks, you know exactly what broke.

Now imagine you've shipped a smart invoice-parsing API that wraps an LLM. One Monday morning, your largest customer calls: "Your API returned a valid JSON object, but the total_amount field is off by a factor of ten on invoices with foreign currencies." Your service returned HTTP 200. Your uptime dashboard is green. By every traditional SLA metric, you didn't break anything. But you absolutely broke something — and you have no contractual language to even describe what went wrong.

This is the gap at the center of most AI API deployments today. The contract that governs what your API promises was written for deterministic systems, and LLMs are not deterministic systems.