Provider batch APIs cut inference cost in half but reshape the engineering contract: job-level idempotency, freshness boundaries, late-result observability, and a tier-aware decision matrix that reroutes 30–50% of LLM spend on workloads the user was never waiting on.
A hosted moderation API turns your safety control into a synchronous external dependency — the build-vs-buy decision, fail-open vs fail-closed tradeoff, and integration discipline that keeps a vendor on the safety-critical path from owning your incident response.
Every default in the LLM stack — pretraining, RLHF, judge LLMs, user feedback — pushes the model toward confident wrong answers. Calibrated abstention only ships if you build the eval, rubric, and UI that pay for it.
Stop is a UI affordance, not a system guarantee. A practitioner's playbook for cancel-safe agents: durable side-effect ledgers, scoped authorization, compensating actions, and what the cancel UI should actually display.
A field guide to attributing cost across compound AI systems — per-span ledgers, on-behalf-of headers, settlement-currency mismatches, and the political surface that decides who pays for tool calls.
Treating conversation history as scrollback is why agents lose the thread by turn 8 and why context bills scale superlinearly. The fix is to call it what it is — a read-heavy database — and design accordingly.
A single autonomy switch is the wrong abstraction for coding agents. Map each tool to a blast-radius tier, scale gates to the tier, and match agent velocity to your rollback velocity.
When a vendor renames a tool response field, your agent doesn't crash — it adapts and ships a degraded answer. Why microservices contract testing has to migrate to the agent stack, and how to wire it in.
Production LLM logs answer 'what did the model say' well and 'what did the model see' poorly — and that gap is what breaks model-migration evaluations months later. A practical schema for replayable traces.
Agent prompts and agent tools look like the same asset on disk, but they fail in completely different ways — and shipping them through one pipeline is the architectural mistake at the root of most agent incidents.
Temperature is not a global knob. Allocate randomness per surface — routing, parsing, synthesis, generation, exploration — or pay for variance you do not need.
Engineering orgs are quietly accumulating a second audience for their internal docs — every developer's AI assistant. The team that writes for only one of those readers is shipping the other broken.