Skip to main content

Skills as Modules: When Your Agent Stack Needs an Import System

· 10 min read
Tian Pan
Software Engineer

A team I talked to last month hit a bug that any seasoned package-manager user would recognize on sight. Two skills in their agent shipped the same search_orders capability — one came from a billing toolpack, one came from a CRM toolpack. Whichever had been added to the manifest most recently won. The agent silently called the wrong one for three weeks. Refunds went to the wrong customer IDs. Their fix, they told me, was a meeting with the CRM and billing engineers to "agree on naming." A meeting. To resolve a name conflict between two installable modules.

That's the moment I realized what's happening in agent runtimes right now. The runtime-loadable capability pattern — skills, tool packs, prompt fragments, retrieval providers, MCP servers — is converging on the same problem languages solved with import systems decades ago. Name resolution. Version pinning. Dependency graphs. Conflict detection. Lazy loading. And most agent runtimes are reinventing each one badly, or not at all, and shipping the bill to their users in the form of meetings.

The pattern itself is sound. Anthropic's Skills, Microsoft's Agent Framework skills, GitHub's gh skill system, the long tail of MCP servers — they all converge on the same idea: a unit of capability with metadata, instructions, and optional executable resources, loaded by the agent at runtime when relevant. Composio counts over a thousand published skills across registries by mid-2026. Awesome-agent-skills repos cross 1,000 entries. APM, the Microsoft-led Agent Package Manager, treats agent configuration the way package.json treats npm dependencies.

That's the surface. Underneath, the architecture is one or two language-design generations behind. And the gap is starting to bite.

The four problems package managers already solved

Walk through what an import system actually does, and the missing pieces in agent runtimes line up neatly.

Name resolution. Python decides which requests you get based on sys.path order, virtualenvs, and the package's declared name. Node decides based on node_modules traversal and package fields. There is a system — wrong, sometimes confusing, but specified. In most agent runtimes today, when two installed skills define a tool with the same name, behavior is "last write wins" or undefined. The MCP working group's own discussion thread on duplicate tool names across servers reads like a Python-2-vs-3 argument from 2008: prefix? namespace? alias? scope? No standard answer yet, and every CLI implements its own ad-hoc rule. Gemini CLI auto-prefixes with the server name. Some Claude skills don't prefix at all. The agent doesn't know which delete_record it's calling.

Version pinning. package-lock.json, requirements.txt with hashes, Cargo.lock — these exist because "this code worked yesterday" should mean something. Agent skills, until very recently, were typically pulled from a GitHub raw URL or a registry's "latest" tag. APM ships an apm.lock.yaml. SkillFortify and skills-lock exist because the gap was painful enough that multiple groups wrote competing tools to fill it. GitHub's gh skill --pin arrived in April 2026. None of these are universal yet. A team installs a skill on Monday, the upstream maintainer changes the system prompt on Wednesday to "fix" something, and on Thursday a regression appears that took three days of bisecting evals to find. This is the dependency-pinning gap, and it makes "this agent worked yesterday" non-reproducible in the most literal sense.

Dependency graphs. Skills increasingly depend on other skills. A code-review skill might compose read-pr, lint, and summarize-diff. The TDCommons publication on Skill Instruction File Dependency Resolution proposes semver-style range constraints declared inside each SKILL.md, with a sidecar lockfile recording the resolved versions. That's roughly what npm shipped in 2010. We are 16 years late and still arguing about whether to ship it at all. Without a real dependency graph, "install this skill" cannot reliably also install what that skill needs. So skills compensate by inlining their dependencies, duplicating prompt fragments, and shipping fat single-file bundles that are exactly the situation library authors invented module systems to escape.

Conflict detection. Even if two skills don't share a tool name, they often share behavior — both want to format the agent's response, both attach a "before-tool-call" hook, both inject a system prompt fragment. Microsoft's Agent Skills Standard explicitly calls out the requirement to "encode decision order and conflict resolution rules" and warns against duplicating overlapping skills with slightly different wording. But as a runtime concern, that warning is shipped to the user as documentation, not enforced as a constraint. Compare this to how a linker treats duplicate symbol definitions — fail loudly at link time. Agent runtimes ship duplicate-behavior conflicts to production and surface them as flaky outputs.

The lazy-loading problem nobody talks about

Lazy loading deserves its own section because it's the place where the analogy to traditional import systems breaks in an interesting way.

Python imports are eager: by the time import foo returns, foo's module body has executed. Agent skills can't be eager — there are too many of them, and stuffing every available skill's instructions into the system prompt would burn the entire context window before the user typed anything. So the dominant pattern is lazy: keep a lightweight catalog (name, description, when-to-use line), and load the full skill body only when the model decides it's relevant.

This is closer to dynamic plugin loading than to static imports. And it means the runtime is making content-addressable loading decisions on behalf of the model. Hermes Agent's GitHub issue tracker has multiple bugs filed about lazy-loaded skills that fail to register because the catalog scanner only matches files literally named SKILL.md with a specific YAML frontmatter shape. The bug class is exactly the one Java's classpath scanner had in 2003: an indexer with implicit format requirements that silently drops anything not matching, with no diagnostic.

The deeper issue: when load decisions are content-addressable and made by the model, the model can be wrong about which skill to load. There is no compile-time check. There is no "did you mean read_pr_v2?" The model picks a skill that almost matches, executes it, and the failure shows up as a wrong answer, not a missing-import error. Module systems work in part because they fail loudly when something is missing or ambiguous. Lazy skill loaders fail quietly, by definition.

A loader worth using treats this as a first-class concern: it tracks which skill the model selected, why (the description tokens that matched), and what it did with that skill. Without those traces, debugging looks like reading entrails.

The supply-chain problem is going to get worse before it gets better

Every package manager eventually had a supply-chain incident. PyPI typosquats. npm event-stream. Crates.io maintainer abandonment. The agent skill ecosystem is sitting at the same stage npm sat at around 2014: thousands of skills, hundreds of contributors, no real signing story, no provenance audit, lockfiles that record SHAs but not signatures.

A "tool shadowing" attack — where a malicious MCP server registers tools with the same name as a legitimate one, exploiting the lack of namespace enforcement — is documented in the OpenAI Agent Builder MCP analysis from RunLayer. It's the agent equivalent of DLL hijacking. The defense isn't smarter prompts; it's namespace isolation enforced by the runtime, the same way Linux dynamic loaders enforce versioned symbol scopes.

Pinning helps. SkillFortify-style content hashes in lockfiles help. Provenance attestation (Sigstore-style) for skill bundles will eventually be mandatory for anything running with elevated tool credentials. None of this is technically novel — it's the same playbook every other ecosystem ran. Agents are just running it at the speed of 2026, with the corresponding compression of "first incident" timelines.

What a real module layer looks like for agents

If I were proposing the architectural target rather than describing the current chaos, it would have these properties — and notice how many are non-negotiable in any serious language ecosystem:

  • Stable identity. Every skill has a fully-qualified name (org/pack/skill@version) that is what the runtime resolves. Display names are display names. Resolution uses the qualified name.
  • Declared dependencies. Each skill names what it requires (other skills, MCP servers, models, tool capabilities). Resolution either succeeds with a satisfying set or fails before the agent runs.
  • Lockfiles by default. The resolved set is recorded with content hashes. agent install is reproducible. agent update is a separate, deliberate verb.
  • Conflict policy at install time, not runtime. If two skills register the same tool, that's a conflict. The user picks: alias one, exclude one, prefix both. The runtime never silently picks for you. This is a flat refusal to ship the bug class my opening anecdote came from.
  • Lazy loading with a real catalog format. A separate, machine-validated index file — not a heuristic scan of SKILL.md files. Adding a skill updates the index; the loader never guesses.
  • Capability-scoped credentials. The skill manifest declares which credentials it needs to call which tools. The runtime grants those scopes and only those. Borrow the OAuth scope vocabulary; the problem is the same.
  • Provenance and signing. Lockfile entries carry signatures. Unsigned skills run in a narrowed sandbox. Adding an unsigned skill to a production agent requires explicit override.

None of this is exotic. Every item on that list has a working precedent in npm, cargo, pip, apt, or homebrew. The work isn't research. The work is taking these well-understood patterns and writing them down as a standard the major agent runtimes can implement against.

The architectural bet

Here's the part that might be controversial. The orchestration framework wars — LangGraph vs. CrewAI vs. AutoGen vs. Microsoft Agent Framework vs. whatever launched last week — are mostly arguing about the wrong layer.

A new orchestration topology gives you a different shape for sequencing agent calls. That's useful. But the bug that ate three weeks of refunds in my opening story didn't happen because the orchestrator had the wrong topology. It happened because the module layer underneath the orchestrator had no name resolution policy. Better topologies on top of broken module layers don't get you reproducibility, supply-chain safety, or composability.

The agent runtime needs a package-manager-grade module layer before it needs another orchestration framework. APM is the most credible attempt at this layer so far precisely because it studied npm and yarn and copied the parts that worked. Lockfiles, manifests, scoped namespaces, deterministic installs — these are not exciting. They are the load-bearing infrastructure that everything more interesting will eventually sit on top of.

If you're building agent platforms in 2026, the question worth asking before you ship another orchestration primitive is: can two different teams in your company install overlapping skill sets without a meeting to resolve naming? If the answer is no, you don't have a runtime yet. You have a demo with a registry attached. The fix isn't a new feature. It's an import system.

References:Let's stay in touch and Follow me for more thoughts and updates