Skip to main content

The Two-Language Problem: Why Type Safety Stops at the Prompt Boundary

· 10 min read
Tian Pan
Software Engineer

Your codebase has two languages, and only one of them has a compiler. There is the strictly-typed code your team writes — TypeScript with strict: true, Python with mypy in CI, Go with its enforced returns — and then there is the prompt: a templated string that gets concatenated, sent to a remote model, and returns another string the runtime hopes to parse. Between those two regions, the type system goes blind. The IDE highlights nothing. The compiler complains about nothing. And the team that ships a feature on the strength of "but it typechecks" has put the load-bearing contract somewhere the contract checker cannot see.

The seam is well-disguised. From the outside it looks like a function call: generate(input: UserQuery): Promise<AgentResponse>. The signature is honest about what flows in and what flows out. The dishonest part is what happens between the call site and the response: the input is interpolated into a prompt template that references field names by string, the model is asked to produce a JSON object that conforms to a schema described in prose inside that prompt, the response comes back as a string that gets handed to a parser, and the parser returns something the type system can finally see again. Every typed expression on either side is asserting things about a region in the middle that has no static guarantees at all.

This isn't a theoretical concern. Teams report a baseline 10–20% schema-failure rate on naive structured outputs in production, and the failures concentrate on exactly the inputs where you can least afford a silent drop — long contexts, deep tool chains, edge-case users. The type system gave a false sense of correctness right up to the moment the malformed JSON came back and the runtime swallowed it.

The Compiler's Last Mile

Static type systems work because they trace data flow through expressions whose shape they can see. They lose the trail at three places in any LLM application, and all three sit in the no-fly zone between the typed call site and the typed response object.

The interpolation seam. A prompt template is a string with named placeholders: "Given the user query {{query}} and their plan {{plan}}, decide if they qualify." The template engine fills the placeholders by string substitution. The type system sees the template as a string. It sees the inputs as a { query: string; plan: string }. It does not see that renaming plan to tier everywhere in the codebase leaves the template asking for a field that no longer exists, because the template's reference to plan is a substring inside a string literal that no static analysis is going to walk.

The pattern that bites every TypeScript LLM codebase eventually: an engineer renames a domain field, the IDE confidently rewrites every typed reference, and the prompt template silently emits "Given the user query {{plan}} and their plan undefined" because the substitution failed and the template engine returned the placeholder unfilled. The model is now reasoning about an undefined plan. The eval is now contaminated. The TypeScript compiler exited 0.

The schema-as-prose seam. Every structured-output call ships a JSON schema to the model. Most teams generate that schema from a Zod or Pydantic definition and then add a description field on each property to instruct the model. From the type system's perspective, the description is a string literal. From the model's perspective, the description IS the prompt for that field. Renaming a field in code without updating its description in code creates a quiet contradiction the model will eventually resolve in a way nobody predicted. The schema and the prose are versioned independently because the type system thinks they're the same thing.

The output-string seam. The model's response arrives as a string. Somewhere downstream, code does JSON.parse(response) and casts the result to the expected type. The cast is a lie the type system has agreed to participate in. The runtime parser may have produced an object missing a required field, or with a field whose value violates a constraint that exists only as a Zod refinement two layers up. Until the cast hits an actual property access that fails, the type checker is satisfied that the cast holds. Until.

What Goes Wrong Without the Disciplines

When the type-safe-on-the-edges, prompt-in-the-middle architecture ships without explicit discipline at the boundary, the failure modes follow a predictable taxonomy.

Silent schema drift. A team renames an enum value in code (tier: "free" | "pro" | "enterprise" becomes tier: "free" | "pro" | "business"), updates the validator, ships the change. The prompt template still refers to "enterprise" in its few-shot examples because the few-shot strings are hardcoded inline. The model continues emitting "enterprise" in its outputs for that intent. The validator now rejects 8% of outputs and the retry logic eats the cost. Nobody notices for two weeks because the dashboard reports "schema pass rate 92%" as a steady-state number, not a regression.

Tool-argument rot. A team adds a new field to a tool's parameter schema and updates the typed signature. The prompt template that names the tool's parameters in prose isn't updated because nobody realized prose was a parameter list. The model continues calling the tool with the old shape. The runtime coerces or drops the missing field silently. Engineering teams report that this class of failure — malformed JSON, missing fields, wrong types — produces more production failures than hallucinations. It's not a model problem. It's a contract-synchronization problem the type system was supposed to prevent and didn't.

Round-trip invisibility. A schema-valid response can be semantically empty. One sentiment classifier shipped where "every record was valid JSON, correct types, correct enums" — and confidence was 0.99 on every input including gibberish, because the model had collapsed to a constant output the schema happily accepted. The validator says "OK." The user-facing feature says "every input is positive sentiment with high confidence." The type system has no opinion either way: the bytes on the wire match the type definition exactly.

Description rot. The description on a schema field starts as accurate prose. Three product changes later, the description still says "the user's account tier (free, pro, or enterprise)" while the field now accepts "business." The model is reading stale instructions on every call. The type system can't see the contradiction because it's inside a string. Reviewers can't catch it in PR review because the diff looks tidy: only the enum literal changed.

The Disciplines That Close the Gap

The fix is not a smarter compiler — it's a set of contracts the compiler can't enforce but the team can. Three disciplines, run together, recover most of what the type system used to give you.

Schema as the single source of truth. Define the shape once, in the typed language, and generate everything downstream from that definition. The JSON schema sent to the model, the prose description embedded in the prompt, the runtime parser, the type the rest of the code consumes — all four derive from one Zod or Pydantic object. Tools like BAML take this further by making the schema a separate DSL that compiles to typed clients in multiple languages, but you don't need a new DSL to start; you need a build step that catches the moment a description string and a field name disagree.

The simplest version: one file per agent, exporting both the Zod schema and the prompt template, with a unit test that interpolates every placeholder against a sample input and asserts the result contains no literal {{ or undefined substrings. That test is the closest thing to a compiler the prompt boundary will get, and it catches the rename-without-template-update bug on the PR that introduces it.

Round-trip eval as a CI gate. The standard eval pattern grades the model's output for quality. The round-trip eval grades the contract: take a sampled input, run the full pipeline, parse the output, and assert that the parsed object satisfies not just the schema but the semantic invariants the schema can't express — distributions vary across inputs, confidence isn't a constant, evidence fields actually appear in the input verbatim, enum values don't drift toward extinct categories. Run this as a CI gate that fires on every prompt change, every schema change, and every model upgrade.

The eval set has to be diverse enough to surface mode collapse, so include adversarial inputs whose only purpose is to detect the model has stopped distinguishing. A cheap version: assert that a sample of 200 production inputs produces at least three distinct outputs across each enum field. Failure means something is collapsing — a stale prompt, a coerced default, a model regression — and the gate fires before users see it.

A prompt linter for the boundary rules. The seam rules that the type system can't enforce can still be checked by a static linter that knows what a prompt template is. The minimum rule set: every placeholder in a template must correspond to a field on the typed input object; every enum value mentioned in a few-shot example must be present in the schema; every field referenced in the description prose must exist in the schema; ordering invariants like "guardrails before persona before tools" hold across edits.

Building this linter is a weekend of work in any language; the value is that it runs in pre-commit and on every PR, surfacing the boundary violations the language compiler is structurally blind to. The investment is small. The cost of not having it is one production incident plus the postmortem that says "we should have checked this in CI."

The Architectural Realization

The type system stopping at the prompt is not a defect in TypeScript or Python. It's a property of the architecture: an LLM is a remote process that consumes natural language and emits text, and there is no static guarantee the type checker can extract from a remote process whose output distribution is sampled from a probability model. What you can have is a contract enforced at every boundary the bytes cross — the input boundary where the template gets filled, the schema boundary where the prose description leaves the codebase, and the output boundary where the response gets parsed back.

The teams that build production-grade AI systems treat the prompt as a typed artifact even when the language doesn't help them. They keep the schema, the template, and the prose description in one file, generated from one source. They run a round-trip eval that grades the contract, not just the quality. They write a linter that checks the rules the compiler can't see. None of those disciplines is glamorous, and none of them shows up on a benchmark. They show up in the absence of incidents.

The two-language problem isn't going away — the LLM is, definitionally, a different language than the code that calls it. What changes is whether your codebase pretends the seam doesn't exist or treats it as a first-class engineering surface with its own tooling, its own tests, and its own owners. The teams in the first category will keep being surprised by silent failures their type checker should have caught and didn't. The teams in the second category will discover that the prompt boundary, properly instrumented, is just another contract — strange and remote, but no less enforceable than the contracts that already work.

References:Let's stay in touch and Follow me for more thoughts and updates