The Account Number Your LLM Could Not Actually Copy
A support agent reads a customer ticket, pulls up the account, summarizes the recent activity, and issues a refund. The refund lands in the wrong account. Not a fabricated account — a real one, one digit off. The model wrote acct_7H9j2 when the customer's record was acct_7H9j3. The trace looks clean: a search call returned the right record, a summarize call produced the right summary, a refund call ran without error. Every step succeeded. The wrong customer got the money.
This is not a hallucination in the sense the postmortem will use. The model did not invent a customer. It transposed two characters of an existing one, and that is a different failure mode — one your eval suite probably never caught, because the synthetic identifiers in your test fixtures were unique by construction. Two account numbers in the same context, three characters of shared prefix, and the language model — which is a token predictor that has never been trained to copy random strings with fidelity — picked the wrong one.
The lesson is structural, not behavioral. The model does not have an attention mechanism that special-cases identifiers. To the model, acct_7H9j2 is a sequence of subword tokens whose continuation probability shifts with every other token in the window. If a near-twin identifier appears in the same prompt, the model is one bad sample away from a quiet substitution that the harness will happily execute.
Why identifiers are a string-copy task the model is bad at
Language models are trained to predict the next token from a learned distribution. UUIDs, ARNs, account numbers, transaction IDs, hash digests, and signed URLs are all high-entropy strings whose value is precisely their unpredictability. They are the worst possible inputs for a system whose entire competence is interpolation over patterns it has seen before.
BPE tokenizers shred these strings into many short subword tokens. One published benchmark measured 24 tokens per UUID against roughly 1.25 tokens per English word. The model is asked to reproduce a 24-token sequence verbatim, often after dozens of other tokens have passed through the attention layer, with no semantic anchor to verify against. When two similar identifiers appear in the same context, the conditional probability of the correct continuation collapses toward a coin flip on each ambiguous token.
The same benchmark ran a 200-item aggregation task that required referencing 100 distinct identifiers. With raw UUIDs the model produced roughly 48 errors per run. Remapped to small integer aliases, the error count fell to about 6. The architectural fact this exposes is sharp: identifier fidelity is a property of the input format, not the model. You can keep the model and change the failure rate by an order of magnitude.
The failure mode is also asymmetric in a way teams underestimate. The model is not equally likely to corrupt every character of an identifier. It is much more likely to corrupt the characters with the highest local entropy — the random-looking middle, where its prior is weakest. So the corrupted identifier often still parses, still passes a regex, still routes to a real record, still completes the tool call. You do not get a loud not found. You get a quiet wrong-customer.
The failure modes you see in production
The transposed-customer story is the loudest version. The quieter ones look like this.
The agent compares two orders by their IDs rather than their fields. The user asked which order was larger. The model answered confidently from the prefix patterns of the IDs themselves, treating the strings as informative tokens rather than opaque keys. The answer was wrong. The trace shows two get_order calls that returned correct data; the model's reasoning chose not to use it.
The agent summarizes a document by referencing its filename. A retrieval tool returned a list of Document records each with an id, a title, and a body. The model's summary cites the title for half the documents and the id for the other half, because the prompt did not name which field carried the human-meaningful content and the model guessed inconsistently.
The agent answers a question about an image by reasoning about the URL string. The tool returned a presigned S3 link. The model never called the follow-on describe_image tool because nothing in the schema told it the URL was a reference rather than a value. It produced a plausible answer derived from the path components and the bucket name.
The agent issues a refund against a transaction ID that no longer exists. A retrieval call returned five transaction IDs in the same response. The model picked one, but in copying it across a tool-call boundary, dropped two characters in the middle. The new ID happened to belong to a different transaction in the same account, and the tool happily refunded it.
In each case the trace looks healthy. The tool call succeeded. The return was well-formed. The agent's reasoning was articulate. The harm came from a string-copy step the architecture never named as a step.
Stop asking the model to copy IDs
The fix is to remove identifier fidelity from the model's job description. The model should never type a raw identifier into a tool call. The harness should resolve identifiers from a stable, typed reference table that the model addresses by symbolic handle.
- https://boundaryml.com/blog/uuid-swap
- https://nikhil-verma.com/blog/llms-unreliable-narrators-uuid-hallucination/
- https://dev.to/nikhilverma/llms-as-unreliable-narrators-dealing-with-uuid-hallucination-151e
- https://www.giskard.ai/knowledge/function-calling-in-llms-testing-agent-tool-usage-for-ai-security
- https://huggingface.co/learn/llm-course/en/chapter6/5
