Dead Reckoning for Long-Running Agents: Knowing Where Your Agent Is Without Stopping It
Before GPS, sailors used dead reckoning: take your last confirmed position, note your speed and heading, and project forward. It works until the accumulated error compounds into something irreversible—a reef you didn't see coming.
Long-running AI agents have exactly this problem. When an agent spends two hours orchestrating API calls, writing documents, and executing multi-step plans, the people running it often have no better visibility than a sailor without instruments. The agent either finishes or it doesn't. The failure mode isn't the crash—it's the silent loop that burns $30 in tokens while appearing to work, or the agent that "successfully" completes the wrong task because its world model drifted an hour into execution.
Production data makes this concrete: agents with undetected loops have been documented repeating the same tool call 58 times before manual intervention. A two-hour runaway at frontier model rates costs $15–40 before anyone notices. And the worst failures aren't the ones that error out—they're the 12–18% of "successful" runs that return plausible-looking wrong answers.
The engineering response most teams reach for is logging. This is the wrong tool. Unstructured logs tell you what happened; they don't tell you where the agent is, how far it has to go, or whether the path it's on leads anywhere productive. You need instruments, not a black box recorder.
Why Agent Progress Is Hard to Measure
The difficulty isn't technical—it's conceptual. Traditional software runs in deterministic steps you can count. A task with 100 database operations is 47% done after 47 operations. Agents don't work like this.
First, the task scope is often unknown at the start. "Research this topic and write a report" might take 20 tool calls or 200, depending on what the agent finds. Progress percentages are meaningless if you don't know the denominator.
Second, agents can be productive without making visible forward motion. An agent spending five minutes synthesizing research across retrieved documents is doing valuable work, but from the outside it looks identical to an agent stuck in a loop. The distinction requires understanding the content of what's happening, not just that something is happening.
Third, agents reconsider. An agent that reaches step 30 of a planned 40-step task and then backtracks to revise its approach isn't broken—it might be doing the right thing. But to an observer watching step count, it looks like regression.
METR's research on long-horizon task completion shows frontier models complete roughly 50% of tasks that take humans a given amount of time, and approximately 80% of tasks that take half as long. The implication: as task duration doubles, difficulty increases non-linearly, and so does the variance in agent behavior. The longer the task, the more you need instruments that can distinguish productive struggle from terminal drift.
The Four Instruments You Actually Need
1. Structured Milestones, Not Log Lines
The first instrument is replacing unstructured log output with structured milestone events. Every meaningful state transition the agent makes—starting a research phase, completing a synthesis step, deciding to revise the plan—should emit a machine-readable event that includes the agent's current goal, what it just decided, and what it believes comes next.
This isn't about verbose logging. It's about designing agents that externalize their internal state at decision boundaries. The event format matters because it's what lets you build dashboards, set alerts, and calculate progress metrics downstream.
An agent that knows it has completed 7 of an estimated 23 research subtasks is infinitely more observable than one that has "been running for 40 minutes." The milestone structure forces the agent to reason explicitly about its own task decomposition, which as a side effect often improves plan quality—the agent can't emit a milestone without having a plan to emit milestones against.
2. Velocity Tracking
The second instrument is tracking execution velocity: how much work is the agent completing per unit of time, smoothed over recent steps?
Velocity dropping toward zero is the clearest signal of stuck-agent syndrome. It catches two distinct failure modes:
The hard loop: The agent calls the same tool with the same inputs and gets the same outputs. Velocity doesn't drop here—it can actually look artificially high—but if you fingerprint recent (tool-name, result-digest) pairs and find 3+ identical entries in a window, you have a loop. The fingerprint approach is effective because it catches content-level repetition rather than just temporal patterns.
The soft stall: The agent keeps executing, keeps calling tools, but each step produces diminishing new information. Results are novel but don't move the task forward. Velocity metrics that track new-information rate (how much of each tool output represents state the agent didn't previously have) distinguish productive tool use from expensive noise.
A concrete threshold: if velocity drops below 20% of baseline for five consecutive steps with no plan revision, escalate. If fingerprinted loops appear three times in a ten-step window, escalate immediately.
3. Confidence Drift Detection
The third instrument tracks the agent's own certainty at each decision point.
Most agent architectures already have confidence signals, even if they're not surfaced: the spread of attention over possible next actions, the length of reasoning chains the agent generates before committing to a step, the frequency of hedging language in agent outputs. Formalizing these into a numeric confidence estimate—and tracking how it changes over time—gives you an early warning system for directional problems.
Healthy agents typically show confidence between 0.85 and 0.95 during normal operation. Two patterns indicate trouble:
Gradual decay: Confidence drifts from 0.9 to 0.7 to 0.5 over the course of execution. This usually means the agent has encountered information that conflicts with its initial plan and is improvising without acknowledging that it's improvising. The world model has drifted from reality but the agent hasn't detected it.
Confidence collapse with reasoning expansion: The agent's stated confidence drops sharply while the number of steps it takes to reach each decision increases. This correlation—less certain but working harder—is the signature of an agent that has lost its frame and is searching without a map.
The trigger isn't any single confidence reading; it's the trajectory. An agent that starts at 0.9, drops to 0.75 after encountering an unexpected API response, and then recovers to 0.85 is behaving correctly. An agent that drops from 0.9 to 0.6 over 15 steps without recovery needs human attention.
4. Budget-Linked Circuit Breakers
The fourth instrument is the only one most teams have already built: token spend limits. It's necessary but insufficient on its own.
The failure mode of pure token limits is that they're binary and late-arriving. The circuit breaker trips at the limit—not when the agent starts going wrong, but two hours later. And it provides no information about what went wrong or where.
Effective budget management combines hard limits with progressive thresholds. At 50% of the token budget, log a structured budget-midpoint event with current velocity and estimated completion. At 75%, require the agent to produce an explicit completion estimate—not to stop it, but to force it to reason about its own remaining work. At 90%, require human confirmation to continue. Hard stop at the limit.
This turns a binary kill switch into a graduated alert system that gives you visibility into budget trajectory before you hit the wall.
Mid-Flight Steering: The Intervention Lever
Instruments are only useful if they feed into the ability to intervene. The worst outcome is detecting that an agent is drifting two hours in and having no option except to kill it and restart from scratch.
Mid-flight steering requires two architectural preconditions. First, the agent's execution state must be durably persisted after each meaningful step—not just in memory, but in a persistence layer that survives process restart. This is what LangGraph calls durable checkpointing, and it's the single biggest architectural decision that separates agents you can steer from agents you can only restart.
Second, the agent must be able to accept external input mid-execution without treating it as a restart. This is harder than it sounds. Most agent frameworks treat the initial prompt as immutable context and subsequent user messages as continuation within a fixed conversation. Steering requires the agent to update its plan in response to new guidance without losing its accumulated state—all the research it's done, all the decisions it's made, all the context it's built up over the previous hour of work.
When both preconditions are met, the intervention can be a nudge: "You're heading toward the wrong subtask, skip C and go directly to D." This is categorically cheaper than the alternative. Practitioners who've implemented steering report that a brief mid-run correction eliminates the need for three to four complete restarts, a roughly 3–4x cost reduction for the sessions that need it.
OpenTelemetry as the Portability Layer
The four instruments above generate data. The question is where that data goes and how you query it.
The emerging answer for teams that want portability is OpenTelemetry with the GenAI semantic conventions (v0.4.0+). These conventions define a standard span format for LLM calls, tool calls, and agent operations—with well-defined attribute names for token counts, model identifiers, tool names, finish reasons, and agent-specific fields like goal progress and loop counts.
The benefit isn't elegance—it's that data structured to the semantic conventions flows into any observability backend that speaks OpenTelemetry: Datadog, Honeycomb, Grafana, New Relic. You instrument once, and your dashboards work wherever your telemetry goes. For teams that haven't settled on a monitoring stack, this matters. For teams that have, it means the agent observability data lives in the same place as the rest of your system observability, which is where you want it when you're debugging a production incident.
When Frameworks Help (and When They Don't)
LangGraph has the best production story for long-running agent observability of any current framework: durable state persistence, structured step execution, LangSmith integration for trace visualization, and native support for checkpoint-based resumption. If you're building long-running agents in Python and have the flexibility to choose your framework, it solves a significant fraction of the instrumentation problem.
CrewAI and AutoGen provide weaker guarantees. CrewAI has task limits and basic fallback handling but no native checkpoint system—agent state is lost if the process dies. AutoGen relies on in-memory state management, which is a significant limitation for tasks that run beyond a few minutes.
The practical implication: if you're using a framework that doesn't provide durable checkpointing, you need to build it yourself at the application layer. This is non-trivial but not complicated—it means writing agent state (goals, completed steps, accumulated context) to a persistent store after each significant action, and building a resumption path that can reconstruct the agent's context from that stored state.
The Architectural Commitment
Dead reckoning in navigation works because sailors committed to the discipline: measuring speed with a chip log, marking time, recording heading. It required instrumentation built into the voyage, not added as an afterthought.
Agent observability has the same character. The patterns described here—structured milestones, velocity tracking, confidence drift detection, budget-linked circuit breakers, durable checkpointing—only work if they're designed into the agent architecture from the start, not bolted on after the agent is already running in production.
The cost of retrofitting is high. An agent built without structured state is hard to checkpoint. An agent built without planned milestones has no natural places to emit progress events. An agent that treats its own confidence as private internal state has nothing to expose to a confidence drift monitor.
The reward for building these in is proportionate. Agent tasks that fail silently after two hours become tasks that fail noisily after twenty minutes with clear diagnostic data. Runaway loops that burn $40 before anyone notices become loops that trigger an alert after three repetitions. And agents that drift from their plan become agents that surface the drift in real time, giving you a steering window before the error compounds into something irreversible.
Long-running agents will become common faster than most teams expect. The engineering discipline for running them reliably is available now—it just requires treating observability as a first-class architectural concern rather than something to add once the agent is "working."
- https://opentelemetry.io/blog/2025/ai-agent-observability/
- https://langfuse.com/blog/2024-07-ai-agent-observability-with-langfuse
- https://galileo.ai/blog/multi-agent-ai-failures-prevention
- https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
- https://arxiv.org/html/2511.08325
- https://eunomia.dev/blog/2025/05/11/checkpointrestore-systems-evolution-techniques-and-applications-in-ai-agents/
- https://ably.com/blog/ai-transport-redirect-steering
- https://blog.langchain.com/building-langgraph/
- https://markaicode.com/langgraph-production-agent/
- https://www.fixbrokenaiapps.com/blog/ai-agents-infinite-loops
- https://galileo.ai/blog/ai-agent-cost-optimization-observability
- https://sparkco.ai/blog/mastering-confidence-scoring-in-ai-agents/
- https://atlan.com/know/ai-agent-observability/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
