The Agent That Read Last Week's Slack Like It Was Yesterday
Your operations agent answers a question about the upcoming launch by quoting a Slack message that says "we'll ship tomorrow." The agent treats that as a present-tense plan and starts writing comms. The message was posted six weeks ago. The ship happened. The retrieval pipeline pulled the right chunk by every metric you measure — semantic similarity to "launch date," top-1 confidence above your threshold, source channel matching the project — and the agent built a plan on a sentence that meant something only inside the meeting where it was written.
The bug is not in the model. The bug is that tomorrow is not a date. It is a pointer to a clock, and the clock the message was written against is not the clock the agent is reading it on. Your retrieval pipeline indexed the body of the message and discarded the frame.
This failure shows up everywhere agents read durable text. A meeting transcript that says "by end of week" gets fed to an agent on a Tuesday three months later, and the agent acts as if the week in question is this one. A ticket comment from Q3 promising "I'll send it Monday" surfaces in an October retrieval, and the agent waits for an email that arrived in August. A customer's support reply from April saying "call me back in an hour" drops into a callback queue in May, and the agent dials.
The retrieved-text view of the world makes every utterance equally present-tense. That is fine when the content is a fact ("the API endpoint is /v2/orders") and dangerous when the content is a deictic — a word whose meaning depends on where and when it was said.
Deixis is untrusted input when it leaves its context
In linguistics, deictic expressions are the ones whose referent only exists relative to the utterance. Here, there, now, yesterday, tomorrow, next week, in two days, the day after, the previous quarter, recently, soon, we, you. None of these mean anything on their own. They are variables, not values. The conversation around them — who is speaking, when, where, to whom — is the binding environment.
A retrieval pipeline that pulls a sentence out of that environment and hands it to a model is, in effect, evaluating an expression with unbound variables and pretending the result is a constant. The model has no native way to recover the binding, because the binding was never in the sentence to begin with. It has to either invent a binding (typically: the agent's own now) or refuse to reason about the sentence. In practice the model invents one, silently, every time.
The right way to think about this is the way security people already think about untrusted input. A document body retrieved from your corpus is not text-the-agent-can-quote; it is text-that-needs-resolution-against-its-source. Relative dates, relative pronouns, and relative places are the unresolved positions. If you would not paste a SQL fragment into a query without parameter binding, you should not paste a deictic sentence into a prompt without the metadata that resolves it.
The published work on this is starting to catch up to what production teams keep finding the hard way. Recent benchmarks for time-sensitive retrieval-augmented generation, including ChronoQA's mixture of explicit and implicit time expressions, demonstrate that even state-of-the-art systems collapse on questions whose temporal scope is implicit in the document. A separate line of work on temporally blind agents shows that LLM agents default to a stationary view of context — assuming the world has not moved between messages — and this assumption is the part that breaks first under realistic session lengths.
Normalization belongs at index time, not at inference time
The natural reflex is to fix this in the prompt. Add a line that says "when interpreting relative dates in retrieved documents, resolve them against the document's authorship timestamp." This works on the easy cases and fails on every interesting one. The model has to first detect that a relative date is present, then locate the authorship timestamp in whatever schema you wrapped the chunk in, then reason about the offset, then carry the resolved value through the rest of its reasoning without losing it. Each of those steps fails some fraction of the time, the failures compound, and the failure mode is silent — the model rarely says "I am unsure how to resolve 'tomorrow' here"; it just picks one.
The cleaner intervention is to do the work once, in the retrieval pipeline, before the chunk ever reaches the model. Temporal expression normalization is not a new problem. Rule-based taggers like SUTime and HeidelTime have been doing it for over a decade in the information-extraction world, mapping every relative time expression into the TIMEX3 format anchored against a document creation time. The output of these systems is not a guess — it is "the phrase 'next Monday' in this document, given a creation time of 2026-03-12, resolves to 2026-03-16." You can ship that resolution into the indexed text itself.
In practice this looks like a small preprocessing step in your ingestion pipeline:
- Extract authorship timestamp for every document (Slack message, ticket comment, transcript turn, email).
- Run a temporal tagger over the body.
- Rewrite each detected relative expression to include its resolved absolute form in parentheses or square brackets — "we'll ship tomorrow [2026-04-03]", "by end of week [w/o 2026-04-07]", "call me back in an hour [around 2026-05-22T14:30Z]".
- Index the rewritten text. Keep the original as a separate field for display.
Now the deictic is no longer unbound. The model retrieves a chunk whose content already contains its own resolution. No prompt engineering required, no per-call reasoning to derive what was knowable at ingestion. The variable was bound once, at the moment in the pipeline where the binding environment was still visible.
This approach has a useful side-effect for evals. Once timestamps are inlined into the text, you can build temporal test cases — "does the agent correctly conclude that the ship in this message has already happened?" — by writing assertions against the resolved form rather than against the model's interpretation. The test becomes a property of the retrieval layer, not a property of the model's reasoning, and it stops being a moving target as you upgrade models.
Authorship metadata as a fallback when normalization is wrong
Normalization at index time is the right default, but it has a failure mode of its own. Taggers are not perfect; ambiguous expressions ("Monday" with no week qualifier near a weekend boundary, "next month" said on the last day of a month, idiomatic uses like "back in the day") will sometimes resolve incorrectly, and a confidently wrong resolution baked into the index is worse than an unresolved one. The hedge is to attach authorship metadata to every chunk as structured fields the model can see — authored_at, channel_id, author, thread_id — and to teach the prompt to treat its own now as a separate, distinguished value.
The combination is what works. The inline resolution carries the easy cases for free; the visible authorship timestamp gives the model a path to recover when the resolution is missing, wrong, or contested. The prompt's job is then small and specific: "if you see a date in a retrieved chunk, prefer the explicit absolute form; if only a relative form is present, resolve it against authored_at, not against the current date."
This is the same shape as the way frontends handle dates from APIs: the server sends an ISO string, the client renders it in the user's locale, and at no point does anyone trust an unqualified "3/4/26" sitting in a database column to mean what they want it to mean. The discipline is to refuse to let an unqualified date exist in your system at rest.
The retrieval pipeline owns the present tense
The deeper shift is conceptual. Retrieval is not the act of fetching text. Retrieval is the act of preparing text for use by something that was not present when the text was produced. Anything that depended on presence — when, where, who, what was visible to both parties — has to be made explicit or it gets lost. The text alone is incomplete; the system around the text is what makes it complete.
This is why the "shove everything into the vector store and hope for the best" version of RAG keeps producing these classes of bugs even as embeddings get better. Better embeddings find better neighbors. They do not resolve deixis. They do not anchor tomorrow to a date. They do not tell the model that the person who wrote "I'll handle it" is on parental leave now. The retrieval relevance metric and the answer correctness metric are measuring different things, and on a corpus with any meaningful share of time-bound, person-bound, or place-bound content, the gap between them is exactly the place where agents quietly go wrong.
A useful operational frame: every chunk in your index has a speech act timestamp and a retrieval timestamp. The chunk's truth is a function of both. A status update is true at the speech-act time; whether it remains actionable at the retrieval time depends on the half-life of the claim it makes. A promise like "I'll ship tomorrow" has a half-life of one day. A configuration value like "the rate limit is 100/min" has a half-life of months or years. The retrieval pipeline that does not distinguish these is the one whose agent will eventually call yesterday's user back today.
What to do tomorrow, resolved against today
If you are looking for a place to start, the smallest defensible intervention is a one-week instrumentation exercise:
- Sample a few hundred retrieved chunks from your production traffic. Annotate which ones contain at least one relative time expression. The fraction will surprise you, especially for Slack, transcripts, and ticket-comment corpora — in conversational sources it is routinely 20-40%.
- For each chunk with a relative expression, check whether the resolved meaning is visible to the model. Usually it is not.
- Pick one corpus where this fraction is highest. Add a temporal normalization step to its ingestion. Reindex.
- Build three or four eval cases where the right answer requires resolving a relative date against an old timestamp ("did this ship happen?" "is this offer still open?"). Run them against the old and new pipeline.
The numbers will make the case for rolling the same treatment to the rest of your sources. The lift is not glamorous. It will not show up in your retrieval recall dashboard. It will show up as the absence of the kind of bug where an agent confidently acts on a sentence whose meaning expired before the question was even asked.
Your index is not a transcript. It is a set of claims, each with a clock attached. The clock is what makes the claim mean anything. If you throw the clock away at ingestion, no amount of inference-time prompting will get it back — and the agent reading the result will keep treating last week's tomorrow as the day after this one.
- https://arxiv.org/html/2603.16862v1
- https://arxiv.org/abs/2508.12282
- https://arxiv.org/html/2510.23853v2
- https://nlp.stanford.edu/pubs/lrec2012-sutime.pdf
- https://arxiv.org/html/2404.07775v1
- https://www.damiangalarza.com/posts/2026-01-07-llm-date-time-context-production/
- https://ragflow.io/blog/rag-review-2025-from-rag-to-context
- https://www.elastic.co/search-labs/blog/context-poisoning-llm
