Coding Agents in the Monorepo: Why Context Windows and 50-Service Repos Don't Mix
Here's a failure mode that happens silently: you ask a coding agent to update the authentication service's token refresh endpoint. The agent produces clean-looking code — confident, well-commented, type-safe. It also calls a method signature that was renamed three months ago in a shared library three directories up. The tests for that endpoint pass because the mock still uses the old signature. The bug surfaces in staging when the real library gets pulled in.
This isn't a hallucination in the abstract sense. The model knew about that method — it existed somewhere in the training data or was briefly visible in context. The problem is architectural: the agent never had access to the current version of the interface it was calling.
This is the central failure mode of coding agents in large monorepos. It's not that the models are bad. It's that they're operating without a map in a building they've only partially seen.
The Scale Mismatch Is Worse Than It Looks
Current LLMs have context windows in the range of 100,000 to 2 million tokens. That sounds large until you measure what an enterprise monorepo actually contains. A typical 50-service repo with shared libraries, infrastructure-as-code, generated types, and test suites can span tens of millions of tokens of source code.
Even if you cherry-pick only the "relevant" files, you run into a subtler problem: relevance is precisely what you don't know in advance. When a change touches a shared utility, the relevant files are everything that imports it — which you can only discover by traversing the dependency graph you haven't loaded yet.
Bigger context windows improve the situation but don't eliminate it. Research on context utilization found that model performance grows increasingly unreliable as input length grows — the model loses its ability to attend uniformly across a very long context. So even when you can fit 500 files into the prompt, the agent treats content near the middle differently from content at the edges. You get unpredictable degradation, not clean failure.
And then there's cost. At current token pricing, indiscriminately including a large fraction of a monorepo in every request is financially unsustainable at any meaningful engineering scale.
The practical ceiling for naive context stuffing — dumping relevant-looking files into the prompt — is somewhere around 2,500 files before tool indexing quality visibly degrades. Most serious monorepos crossed that line years ago.
What Agents Are Actually Doing When They Hallucinate Interfaces
When a coding agent calls a function with the wrong signature, what happened mechanically?
The agent needed to call a cross-package interface. It either saw that interface earlier in the session (and the session rolled over), inferred it from similar patterns in other parts of the codebase it did see, or interpolated from training data for a similar library. None of these are reliable — but they all produce syntactically valid output that will fool a cursory review.
The fix isn't to add more error handling or post-processing. The fix is ensuring the agent has a reliable, current view of the interface before it writes the call. That requires active context construction, not passive context stuffing.
Repo-Map Indexing: The First Layer of the Solution
The most mature approach to this problem is the repo map — a compact structural index of the entire codebase that fits within a reasonable token budget.
Aider's implementation is the canonical example. Rather than including file contents, the repo map lists files alongside their key symbols: class names, function signatures, exported types. The map itself occupies around 1,000 tokens by default, but it gives the model a topology of the entire repo. When the agent needs to understand a specific interface, it knows which file to request — and it can ask for that file's actual contents rather than inferring from memory.
The selection algorithm underneath this is a personalized PageRank over a dependency graph. Each file is a node; import statements and symbol references are edges. Files that are heavily referenced score higher and get more representation in the map. This naturally surfaces the shared utilities and core abstractions that are most likely to matter for any given change — the packages that, if misunderstood, cause the most downstream breakage.
Tree-sitter parses source files into abstract syntax trees, so the symbol extraction is language-aware rather than regex-based. A Go interface, a TypeScript type export, and a Python class definition all get parsed correctly and represented uniformly in the map.
This approach doesn't require the agent to know what's relevant before it starts. It front-loads structural awareness and lets the agent navigate from the overview to the specifics.
Dependency Graph Traversal for Cross-Service Changes
Repo maps capture signatures but not runtime dependency relationships — what calls what across service boundaries, what the blast radius of a change actually is.
This is where dependency graph tools add a distinct layer. Tools like Ctxo and the Nx project graph give agents query access to the full dependency topology. Instead of the agent guessing which services depend on the auth library it's modifying, it can issue a query: which packages import this module? The answer comes back as a list, not an inference.
The progressive exploration pattern this enables looks like:
- Start at the domain level — which bounded context does this change belong to?
- Use the project graph to identify affected packages and their consumers.
- Pull in the actual file contents for those specific packages.
- Make the change with full awareness of the interfaces on both sides.
The key difference from naive retrieval is that the agent never has to hold all of this in context simultaneously. It traverses the graph in steps, building up a coherent picture rather than trying to ingest everything at once.
For understanding blast radius specifically — what breaks if you change this function — graph traversal is more reliable than vector similarity. Semantic similarity doesn't know that two services are coupled through a shared Protobuf schema. The dependency graph does.
Scoped Context Construction: Domain Before Files
Both repo maps and dependency graphs feed into a broader principle: context construction should be scoped, not exhaustive.
The instinct when an agent fails is to add more context. That's usually wrong. More context means more noise, more tokens spent on irrelevant files, and a higher chance the model loses the signal in the middle of a long prompt. The right move is sharper scoping.
A practical architecture for this works in three levels:
Level 1: Architectural steering. A root CLAUDE.md or AGENTS.md file describes the repo's high-level structure — where the services live, what naming conventions mean, which directories are generated vs. hand-authored. This costs a few hundred tokens and prevents the agent from wasting exploration steps on basic navigation.
Level 2: Domain-scoped documentation. Each major service or domain owns a file that describes its conventions, its interface contracts, and its dependencies. The agent reads only the relevant one — not all of them.
Level 3: On-demand file retrieval. The repo map tells the agent which files contain the symbols it needs. The dependency graph tells it which packages are in scope. Only then does it pull actual file contents.
This layered approach keeps the working context small and high-signal for any given task, while preserving the agent's ability to navigate the full codebase when it needs to.
The Two-Stage Retrieval Pattern for Interface Accuracy
For the specific problem of interface hallucination, two-stage retrieval is the most reliable mitigation.
In the first stage, you retrieve candidates — files, functions, and types that are semantically related to the task. This is where vector search does its job: finding things that look relevant based on embedding similarity.
In the second stage, an LLM filters and ranks the candidates with awareness of the actual task. This step catches the cases where semantic similarity led you to an outdated version of a type, or to a function with the same name in a different package. The filtering model can ask: does this symbol have the right interface for the calling context? Is this the current canonical definition, or a legacy version?
The two-stage pattern doesn't guarantee the agent sees every relevant file. But it substantially reduces the rate at which it gets confidently wrong information. The first stage casts wide; the second stage cuts noise.
What This Means for How You Structure a Monorepo for AI
If you're running a monorepo and expecting coding agents to operate across it, the structural decisions you make at the repo level have direct consequences for agent reliability.
Explicit interface contracts matter more than they used to. When interfaces are expressed as versioned types with clear export paths, the dependency graph can find them and the repo map can represent them. Implicit interfaces that live in undocumented conventions are invisible to both.
Domain tagging helps agents navigate before they start reading. A tagging system — even just directory naming conventions that group services by domain — gives the agent a way to start from a meaningful abstraction level rather than from the file system root.
Steering documents at each service boundary need to be maintained. These are the per-domain AGENTS.md files that describe what the service does, what it depends on, and how its interfaces are organized. They're low-cost to write and high-value for agents that would otherwise have to infer all of this from first principles.
The agent failure mode we started with — calling a renamed method — is preventable. Not by making the model smarter, but by ensuring the current interface definition is findable, indexed, and surfaced before the model starts generating code. That's a tooling and structure problem, and it has tooling and structure solutions.
The teams getting reliable cross-service agent work today aren't using bigger context windows. They're building the infrastructure that lets agents navigate precisely.
- https://factory.ai/news/context-window-problem
- https://monorepo.tools/ai
- https://aider.chat/docs/repomap.html
- https://dev.to/datadog-frontend-dev/steering-ai-agents-in-monorepos-with-agentsmd-13g0
- https://intuitionlabs.ai/articles/ai-code-assistants-large-codebases
- https://cybercorsairs.com/graph-your-codebase-so-ai-stops-hallucinating-about-it/
- https://www.augmentcode.com/tools/monorepo-vs-multi-repo-ai-architecture-based-ai-tool-selection
- https://packmind.com/context-engineering-ai-coding/context-engineering-large-codebases/
