Dead Reckoning for Long-Running Agents: Knowing Where Your Agent Is Without Stopping It
Before GPS, sailors used dead reckoning: take your last confirmed position, note your speed and heading, and project forward. It works until the accumulated error compounds into something irreversible—a reef you didn't see coming.
Long-running AI agents have exactly this problem. When an agent spends two hours orchestrating API calls, writing documents, and executing multi-step plans, the people running it often have no better visibility than a sailor without instruments. The agent either finishes or it doesn't. The failure mode isn't the crash—it's the silent loop that burns $30 in tokens while appearing to work, or the agent that "successfully" completes the wrong task because its world model drifted an hour into execution.
Production data makes this concrete: agents with undetected loops have been documented repeating the same tool call 58 times before manual intervention. A two-hour runaway at frontier model rates costs $15–40 before anyone notices. And the worst failures aren't the ones that error out—they're the 12–18% of "successful" runs that return plausible-looking wrong answers.
The engineering response most teams reach for is logging. This is the wrong tool. Unstructured logs tell you what happened; they don't tell you where the agent is, how far it has to go, or whether the path it's on leads anywhere productive. You need instruments, not a black box recorder.
Why Agent Progress Is Hard to Measure
The difficulty isn't technical—it's conceptual. Traditional software runs in deterministic steps you can count. A task with 100 database operations is 47% done after 47 operations. Agents don't work like this.
First, the task scope is often unknown at the start. "Research this topic and write a report" might take 20 tool calls or 200, depending on what the agent finds. Progress percentages are meaningless if you don't know the denominator.
Second, agents can be productive without making visible forward motion. An agent spending five minutes synthesizing research across retrieved documents is doing valuable work, but from the outside it looks identical to an agent stuck in a loop. The distinction requires understanding the content of what's happening, not just that something is happening.
Third, agents reconsider. An agent that reaches step 30 of a planned 40-step task and then backtracks to revise its approach isn't broken—it might be doing the right thing. But to an observer watching step count, it looks like regression.
METR's research on long-horizon task completion shows frontier models complete roughly 50% of tasks that take humans a given amount of time, and approximately 80% of tasks that take half as long. The implication: as task duration doubles, difficulty increases non-linearly, and so does the variance in agent behavior. The longer the task, the more you need instruments that can distinguish productive struggle from terminal drift.
The Four Instruments You Actually Need
1. Structured Milestones, Not Log Lines
The first instrument is replacing unstructured log output with structured milestone events. Every meaningful state transition the agent makes—starting a research phase, completing a synthesis step, deciding to revise the plan—should emit a machine-readable event that includes the agent's current goal, what it just decided, and what it believes comes next.
This isn't about verbose logging. It's about designing agents that externalize their internal state at decision boundaries. The event format matters because it's what lets you build dashboards, set alerts, and calculate progress metrics downstream.
An agent that knows it has completed 7 of an estimated 23 research subtasks is infinitely more observable than one that has "been running for 40 minutes." The milestone structure forces the agent to reason explicitly about its own task decomposition, which as a side effect often improves plan quality—the agent can't emit a milestone without having a plan to emit milestones against.
2. Velocity Tracking
The second instrument is tracking execution velocity: how much work is the agent completing per unit of time, smoothed over recent steps?
Velocity dropping toward zero is the clearest signal of stuck-agent syndrome. It catches two distinct failure modes:
The hard loop: The agent calls the same tool with the same inputs and gets the same outputs. Velocity doesn't drop here—it can actually look artificially high—but if you fingerprint recent (tool-name, result-digest) pairs and find 3+ identical entries in a window, you have a loop. The fingerprint approach is effective because it catches content-level repetition rather than just temporal patterns.
The soft stall: The agent keeps executing, keeps calling tools, but each step produces diminishing new information. Results are novel but don't move the task forward. Velocity metrics that track new-information rate (how much of each tool output represents state the agent didn't previously have) distinguish productive tool use from expensive noise.
A concrete threshold: if velocity drops below 20% of baseline for five consecutive steps with no plan revision, escalate. If fingerprinted loops appear three times in a ten-step window, escalate immediately.
3. Confidence Drift Detection
The third instrument tracks the agent's own certainty at each decision point.
Most agent architectures already have confidence signals, even if they're not surfaced: the spread of attention over possible next actions, the length of reasoning chains the agent generates before committing to a step, the frequency of hedging language in agent outputs. Formalizing these into a numeric confidence estimate—and tracking how it changes over time—gives you an early warning system for directional problems.
Healthy agents typically show confidence between 0.85 and 0.95 during normal operation. Two patterns indicate trouble:
- https://opentelemetry.io/blog/2025/ai-agent-observability/
- https://langfuse.com/blog/2024-07-ai-agent-observability-with-langfuse
- https://galileo.ai/blog/multi-agent-ai-failures-prevention
- https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
- https://arxiv.org/html/2511.08325
- https://eunomia.dev/blog/2025/05/11/checkpointrestore-systems-evolution-techniques-and-applications-in-ai-agents/
- https://ably.com/blog/ai-transport-redirect-steering
- https://blog.langchain.com/building-langgraph/
- https://markaicode.com/langgraph-production-agent/
- https://www.fixbrokenaiapps.com/blog/ai-agents-infinite-loops
- https://galileo.ai/blog/ai-agent-cost-optimization-observability
- https://sparkco.ai/blog/mastering-confidence-scoring-in-ai-agents/
- https://atlan.com/know/ai-agent-observability/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
