Skip to main content

Your Clock-in-Prompt Is a Correctness Boundary, Not a Log Field

· 10 min read
Tian Pan
Software Engineer

A scheduling agent booked a customer's onboarding call for Tuesday instead of Wednesday. The investigation took two days. The prompt was fine. The model was fine. The calendar tool was fine. The bug was that the system prompt carried a current_time field stamped an hour earlier, when the request routed through a cached prefix built just before midnight UTC. By the time the agent parsed "tomorrow at 10 AM" and called the booking tool, "tomorrow" referred to a day that was already "today" for the user in Tokyo.

The agent had no way to notice. It had nothing to notice with. LLMs do not have clocks. They have whatever string you handed them in the prompt, and they treat that string as authoritative the same way they treat the user's question as authoritative — which is to say, completely, without skepticism, without a second source to cross-check against.

Most teams know this in the abstract and still treat the timestamp they inject like a log field: something nice to have, rendered into the system prompt for context, nobody's explicit responsibility, nobody's correctness boundary. That framing is wrong. The timestamp is a correctness boundary. Every agent behavior that depends on "now" — scheduling, expiration, retry windows, "recently," "tomorrow," "in five minutes," freshness checks on retrieved documents — runs on whatever your time plumbing produced, and inherits every bug that plumbing has.

The four ways your clock lies to the agent

There are four distinct bugs that produce a wrong "now" inside the model, and they fail in different ways. A team that fixes one and ships often thinks they fixed the whole category.

Timezone default. Your server runs in UTC because that is the sensible server default. Someone wrote datetime.now() in the prompt-assembly code, without a timezone argument, because "the server is UTC, so UTC is fine." For a user in California at 10 PM local, UTC is already tomorrow; the agent asked for "today's schedule" now fetches the wrong calendar day. For a user in Tokyo at 7 AM, UTC is still yesterday; "did I complete my workout today" returns yesterday's workout. The bug is deterministic per time-of-day and per user location, which means it reproduces for some users and not others, which means it takes a long time to surface.

Stale cached prompt. You cache the system prompt for cost reasons. Prompt caching works by matching a prefix; any byte that changes invalidates the cache. A naive implementation rebuilds the system prompt with a fresh timestamp every request, and the cache never hits — so someone moves the timestamp into a stable location "to keep the cache warm," and now the timestamp is the cache key's prefix, which means it is frozen at whatever it was when the prefix was built. The agent now gets an authoritative-looking current_time that is minutes, hours, or in pathological cases a full day old. There is a real architectural tension here that teams resolve in opposite directions and neither direction is correct: either you break cache on every call, or you serve stale time. The right answer is that the stable prefix contains no time at all, and "now" lives in the per-turn user message where it belongs — but that requires someone to notice the tension exists.

DST transitions. Twice a year, in jurisdictions that still observe it, the local clock moves by an hour. The agent reasoning about "schedule this for tomorrow 3 PM" picks up "tomorrow" from the injected date, adds "15:00," and hands that to a calendar API that interprets 15:00 in whatever timezone the API was told to use. If the agent's clock string was built before the DST transition and the booking fires after, the hour is wrong. The same bug applies to expirations: "this link is valid for 24 hours" stamped at the wrong side of a DST shift expires an hour early or an hour late, and the user whose session died forty minutes before they expected calls support.

Context staleness in long-running agents. Modern agents are not request-response — they run for minutes, sometimes hours, through many tool calls. The system prompt was assembled at step zero. At step forty, the tool call fires. The current_time value the agent reasons with at step forty is whatever it was at step zero, unless you went out of your way to refresh it. "Retry in 5 minutes" decided at step zero and executed at step forty is now forty-five minutes in the past or forty-five minutes in the future, depending on how the agent interpreted its own instruction. MCP sessions and OAuth tokens expire inside these long runs too, for the same structural reason: "when does this become invalid" was computed against a snapshot of time that no longer applies.

Each of these is its own incident class. "We fixed the timezone bug" is a true statement that does not imply "we fixed the time bug."

Time is a fact, not a decoration

The reframe that matters: treat "now" the same way you treat any other fact the agent is expected to reason about correctly. Facts have provenance, they have a freshness window, they have explicit types, and they appear in the prompt in a location where the agent is expected to actually read them rather than glance past them.

Concretely, this means the time context should be:

  • Explicitly asserted, not buried in preamble. A line that says The current time is 2026-04-23T14:22:00-07:00 (Pacific Daylight Time). The user's local date is Thursday. is a fact the agent can cite. A vague "today is April 23, 2026" stamped in a block of boilerplate is decoration.
  • Timezone-qualified, with both the offset and the named zone. The offset alone (-07:00) does not disambiguate during DST transitions. The IANA zone name does, and it carries the future-dated transition rules with it.
  • Decomposed into the forms the agent will actually use. Agents ask "what day of the week is it" far more often than you'd think, and LLMs are notoriously bad at computing day-of-week from a bare date. Include day-of-week in the injected fact so the model does not have to do modular arithmetic it is poor at.
  • Refreshed at step boundaries, not only at the start of the conversation. Every non-trivial planner step, every tool call that depends on recency, every loop iteration — re-inject a fresh, explicit "now" as part of the step's user message. The cost is a few dozen tokens. The correctness gain is that the agent cannot drift more than one step out of sync.
  • Out of the cached prefix. The stable system prompt describes the agent's role, tools, and invariants. It does not describe time. Time belongs in the per-turn user-visible payload, where it can change without invalidating the expensive KV-cache prefix, and where the agent treats it as fresh context rather than ambient truth.

The discipline sounds fussy until you line up the incident post-mortems and notice that every one of them traces to a version of "the agent trusted a stale, under-specified time string."

The evals most teams don't run

If time is a correctness boundary, evals should probe it adversarially the way they probe any other correctness boundary. Most eval suites do not. The golden cases use "now" equals whenever the eval ran, which is exactly the case the production bug will not hit. Academic work like Test of Time has identified scheduling and multi-step temporal reasoning as categories where frontier models do worst, and production eval suites systematically avoid exercising them.

A useful adversarial clock-fixture suite looks like this:

  • Timezone mismatch: server in UTC, user in UTC+12, user in UTC-11. Run the same prompt against all three and verify the answers are consistent in the user's frame, not the server's.
  • DST boundary: the spring-forward and fall-back days in the user's zone. Both the day itself and the hour before and after. If an agent schedules across this boundary, the offset math has to survive.
  • Day-of-week rollover: the fifteen minutes before and after the user's local midnight. Ask about "today" and "tomorrow" at each sample. Compare.
  • Stale context: assemble the prompt with a timestamp that is five minutes, one hour, and one day old, and verify the agent either (a) uses fresh time from a tool call or (b) explicitly flags that its time context is stale. Silent acceptance of stale time should fail the eval.
  • Long-run drift: inject a timestamp at step zero, run forty synthetic steps, and check whether the agent is still reasoning against step-zero time at step forty, or has refreshed.

These cases are cheap to synthesize. They find bugs that never appear in the request-logs-replayed eval harness most teams are already running.

The tool-use pattern that actually helps

If time is stale, the agent needs a way to discover it is stale and refresh. The pattern that works is a small, cheap get_current_time tool — trivial to implement, returns an ISO-8601 string with zone — and a system instruction that tells the agent to call it before any tool call whose semantics depend on absolute time (booking, expiration checks, retry scheduling, recency filtering).

There are two failure modes to guard against. Agents will over-call the tool if you instruct them too loudly — every response starts with a clock lookup for no reason, wasting tokens and latency. Agents will under-call the tool if the system prompt looks like it already has a reliable clock — the injected timestamp, even a stale one, signals "you have this, don't ask." The instruction needs to be specific: call the tool when a tool call's correctness depends on time being accurate to the minute, not when the user casually mentions "today."

The deeper point is that the tool's existence is what makes time self-correcting. An agent that can always refresh its clock can survive a stale system prompt; an agent that cannot has to hope the system prompt was assembled correctly, and hope does not scale.

Who owns the clock

This is the organizational question the incidents keep surfacing. The agent team owns the prompt. The platform team owns the deployment's timezone configuration. The integrations team owns the calendar and booking APIs, each with their own idea of "now." The eval team owns the harness that tests scheduling — against whatever clock the harness happened to run under.

When an agent books a meeting on the wrong day, every team can truthfully say their component worked. The bug lives in the seams between them, which is the same as saying nobody owns it, which is the same as saying it keeps happening. Someone — usually whoever ran the post-mortem for the third Tuesday-versus-Wednesday incident — needs to write down that the clock-in-prompt is a correctness contract between the prompt-assembly code, the caching layer, the agent's tools, and the eval harness, and that one owner is responsible for the contract as a whole.

Until that role exists, the bug class persists. The agent will keep confidently scheduling against a clock that was right when somebody stamped it into a string hours ago, and the user whose calendar is wrong will keep assuming the AI is getting smarter and wondering why their Tuesday meeting is on Wednesday.

References:Let's stay in touch and Follow me for more thoughts and updates