Skip to main content

Disconnected Agent Mode: Designing for the Network You Don't Have

· 11 min read
Tian Pan
Software Engineer

A flight attendant asks you to switch to airplane mode. The customer-support agent your team shipped last quarter is mid-conversation in a tab, and the next user turn returns a spinner that never resolves. The agent isn't broken in any interesting way. It just assumed, in a hundred unwritten places, that the network exists.

That assumption is the most expensive line of code your product team never wrote down. It governs how you store conversation state, how you call tools, how you surface errors, what you eval against, and what your users do when the connection drops in the middle of work that mattered to them. Disconnected agent mode is the discipline of pulling that assumption out of the foundation, looking at it, and deciding — explicitly — what should happen when the round trip to a hosted API isn't available.

The discourse around AI agents has been built on a frontier-model frame: the strongest possible model, hosted in someone else's data center, reachable over HTTPS. That frame is fine for the demo and for the marketing site. It breaks the moment your product meets the actual physics of where users live. Laptops on planes. Field technicians in basements with concrete walls. Retail kiosks behind a Wi-Fi router that reboots itself nightly. Hospital tablets on a guest network with deep packet inspection that randomly drops streaming connections. A car driving through a tunnel. Each of these is a reasonable place to use an AI feature, and each of these is a place where "round trip to a hosted API for every step" is not a working assumption.

The build-time assumption your team never wrote down

If you ask the people who built your agent "what does the system do when the network goes away," you'll usually get a long pause and then an answer that sounds like a guess. The pause is the tell. The disconnected case wasn't designed; it's whatever the framework happened to do when a fetch failed. That is almost never what you want.

Here is what you usually find when you go look: the conversation history lives only on the server, so a disconnected refresh wipes it. The next user turn calls a hosted model, gets a network error, and renders a generic "something went wrong" toast that gives the user no actionable signal. Tool calls — the ones that draft a message, file a ticket, schedule a callback — are issued synchronously inside the agent loop, so when the network is gone, the agent simply can't take action and there's no queue waiting to flush. State that the user thought they had committed (a saved draft, a pinned answer, a configured plan) lives in volatile in-memory state and disappears on reconnect. The "loading" UI is the only feedback the user gets, and it's indistinguishable between "the model is thinking" and "the network has been gone for forty seconds."

None of this is a bug in the framework. It's the result of nobody making a design decision. The disconnected behavior is whatever fell out of the build-time assumption that the network is always there.

Disconnected-first architecture, in four moves

Treating disconnect as a first-class state, not an error condition, requires four architectural shifts. Each of them changes a default that most agent stacks ship with.

A local model on the hot path, with cloud-augment when reachable. The 1B–3B parameter class of small language models has crossed a usability threshold for routine agent tasks. Models like Phi-3 Mini, Gemma 3n 2B, and Llama 3.2 1B/3B run on commodity hardware in 1–2 GB of memory and decode at tens of tokens per second on M-series Apple silicon and modern mobile NPUs. They are not GPT-class for hard reasoning, and they don't need to be. They need to be good enough to handle the answer-from-context, summarize-this-thread, draft-a-reply traffic that makes up most of an agent's workload, with the cloud model reserved for the long-tail queries that genuinely require it. The architectural inversion is the important part: the local model is the default path; the hosted model is an enhancement layered on top when the connection allows. Most teams have this exactly backwards.

Queued tool calls with deferred side effects. When the agent decides to file a ticket, send a message, or update a record, the call should land in a durable outbox before it touches the network. The outbox is a write-ahead log: an append-only record of intended side effects, each tagged with a stable client-generated identifier so retries are idempotent on the server. When connectivity returns, a sync worker drains the queue with backoff, deduplication, and explicit conflict resolution for entries the server has seen out of order. The user gets feedback the moment the intent is captured, not when the side effect lands. This is well-trodden territory in offline-first mobile development; the AI agent stack is just rediscovering it ten years late.

Capability-degradation tiers the user can see. The worst version of "no network" UX is a single boolean: online or broken. Real disconnect is a spectrum. Sometimes the local model can answer. Sometimes the local model can answer but the tool-call backend is unreachable. Sometimes a critical document is in a remote vector store that hasn't been mirrored locally. Sometimes the connection is up but the latency is unusable. Each of these is a different capability tier, and the UI should expose which one the user is currently in — explicitly, with a small visible badge or status line. Users will tolerate a degraded mode if they understand they're in one. They will not tolerate an agent that silently lies to them about what it can do.

Eventual-consistency semantics for memory and state. If the agent has memory — pinned facts, user preferences, prior context — and that memory can be modified from multiple places (a phone, a laptop, the web), then you have a distributed-systems problem with the same shape as any multi-device sync product. CRDTs and event-sourced state, where the conversation is an append-only log of prompts, tool calls, and outcomes, give you a deterministic merge story. The alternative — last-write-wins on a single mutable record — works fine until two devices come back online and one of them silently overwrites the other's work. SQLite-based local stores with CRDT sync layers are now mature enough that this isn't a research project; it's a library choice.

Airplane mode as a first-class test environment

Most agent eval suites assume the network exists, and they run on an environment that has it. The result is that the disconnect path is the largest untested surface in the product. Teams discover its bugs the way users do: in the field, at the worst possible moment.

Treating airplane mode as a first-class test environment means a few specific things. The eval harness should include scenarios where the network is severed mid-turn, severed mid-tool-call, severed during streaming, and severed during a long-running background sync. It should include scenarios where latency is high but not infinite — the long tail of "technically online but functionally useless" connections that mobile users live in for a meaningful fraction of their day. It should include scenarios where the network returns and a backlog of queued operations needs to drain in order, with the eval grading both functional correctness and the observable user experience during the drain.

Chaos engineering for AI agents is a young field, but the playbook from the broader infrastructure world transfers cleanly: inject failures continuously, in CI, with assertions on the user-visible outcome, not just the internal exception. The injection points are different — failed tool calls, model timeouts, streaming disconnects, partial responses — but the discipline is the same. An agent that has only ever been tested under good network conditions has not been tested.

There's a particular failure mode worth calling out: the eval that runs a disconnected scenario, observes that the agent returns some answer, and grades it as a pass. The local model produced output, so the test was green. But the answer was wrong because the agent silently fell back to a 1B model on a question that needed the 70B model, and nothing in the test harness checked for the model-tier downshift. Eval scoring needs to distinguish "answered correctly under the available capabilities" from "answered at all," and capability-degradation tiers need to flow into the grading rubric.

Where the seam between local and cloud actually lives

The hardest design question in disconnected agent architecture isn't whether to have a local model. It's where the routing decision lives and who can see the seam. Three patterns have emerged.

The first is silent fallback: the agent runs in the cloud when it can and on-device when it can't, and the user never knows which. This is simplest to ship and hardest to operate. When quality regresses for an individual user, you have no idea whether they're getting served by the local or the cloud path. Debugging is guesswork.

The second is explicit routing with a visible indicator: the user sees a small badge that tells them which mode they're in, and the eval suite tracks per-mode quality separately. This is more honest and more debuggable. It also turns out to be more trusted by users, who tend to forgive a degraded mode they understand and resent a single mode that mysteriously underperforms.

The third is user-controlled mode: the user explicitly selects "fast local" vs. "best cloud" the way they currently select reasoning effort or model size. This works well for power users and badly for everyone else, and it punts the routing problem to the user instead of solving it. Most products will end up with the second pattern — explicit routing surfaced as a status, not a setting — because it sits in the right place between operability and user burden.

The seam between local and cloud is also where your architecture decides whether the agent is one product or two. If the local agent and the cloud agent share the same conversation log, the same memory schema, the same tool registry, and the same eval suite, they are one product with two backends. If they diverge — different tool sets, different memory, different prompts — you've shipped two agents and the user is the integration layer between them. The latter is where teams end up by default and is consistently the source of the worst bugs at the seam.

What this changes about the roadmap

Disconnected agent mode is not a feature you bolt on after the cloud version is done. It is an architectural commitment that has to land before you ship the data model, because retrofitting it requires changing how state is stored, how tools are invoked, how UI feedback is structured, and how you eval. A team that ships the cloud-only version first and "adds offline support later" almost always discovers that "later" means "rewrite the agent."

The product implication is that the question "should this product support offline?" is not a feature checkbox; it's a stance on what the product is. A consumer chat app on a phone has to assume disconnect. A developer tool used on flights has to assume disconnect. A field service app installed on rugged tablets has to assume disconnect. A retail kiosk with consumer-grade Wi-Fi has to assume disconnect. The set of products where the network is reliably present and high-bandwidth is smaller than the architecture decisions in most agent stacks would suggest.

The build assumption that the network exists was reasonable when AI products lived inside the browser tab on a desk in an office. It is not reasonable for the surface area where AI agents are heading: phones, laptops, kiosks, vehicles, devices, every place users actually are. The teams that write that assumption down, look at it, and decide what disconnected behavior should be will ship products that work where their users live. The teams that don't will ship products that work in the demo and break in the wild.

Pick a frame: the network is present unless proven absent, or the network is absent unless proven present. The first frame produced the agent stacks of 2024. The second frame is what 2026 is asking for.

References:Let's stay in touch and Follow me for more thoughts and updates