Skip to main content

Communicating AI Limitations Across the Organization: A Framework for Engineering Leaders

· 11 min read
Tian Pan
Software Engineer

The demo worked perfectly. Legal had signed off. Sales was already promising customers the feature would ship next quarter. Then the first production failure happened — the model confidently drafted a clause that cited a contract term that didn't exist, sales forwarded it to a customer, and legal spent three weeks in damage control.

This is not a story about a bad model. It's a story about miscommunication. The engineering team knew the model could hallucinate. Legal assumed it wouldn't. Sales assumed any failure would be caught before reaching customers. Ops assumed someone else was monitoring for exactly this. Nobody was lying. Everyone was working from a different mental model of the same system.

The root cause of most AI project failures isn't the AI. According to RAND Corporation's analysis of failed AI initiatives, "misunderstood problem definition" — which includes miscommunication about capability limits — is the single most common cause. Between 70 and 95% of enterprise AI initiatives fail to deliver their intended outcomes, and the technology is rarely the limiting factor. The limiting factor is that every team in your organization is quietly building a different theory of what your AI system does, and nobody has explicitly corrected any of them.

The Three Teams That Will Misunderstand Your System

The miscommunication isn't random. Each non-engineering team tends to import the same wrong mental model, derived from the tools they already know.

Legal teams assume determinism. Legal professionals spend their careers in formal systems where the same inputs always produce the same outputs. A contract clause means what it says. A statute applies consistently. When they encounter an LLM, they unconsciously apply the same expectation.

If they've reviewed an output once and it looked fine, they assume future outputs in the same category will also look fine. They don't internalize that the system can produce opposite conclusions on the same hard question depending on phrasing, message order, or even which data center handled the request. Setting temperature to zero helps — but does not eliminate variance. That fact surprises almost every legal team when they first hear it.

The practical consequence: legal signs off on a narrow set of outputs as acceptable, then discovers in production that the system isn't constrained to those outputs. When something falls outside the approved envelope, their instinct is to treat it as a bug rather than an inherent property of the system.

Sales teams oversell reliability. Sales teams are optimized to close, and AI demos are extremely demo-able. A well-curated demo showing the system performing confidently on a representative sample creates a memorable impression. Sales then faithfully reports what they saw — to customers, to partners, to executives. They're not deliberately misleading anyone. They genuinely haven't been shown the 13% of cases where the system fails, because the engineering team showed them the system at its best. The result is that customer-facing promises are built on a cherry-picked sample of behavior, and the first edge case in production becomes a customer complaint rather than an expected occurrence that was communicated upfront.

Beyond the demo problem, sales often doesn't understand the data-quality prerequisites that enable the AI feature. A system that generates personalized outreach based on account data only works as promised when the account data is current and complete. Sales teams frequently discover this relationship after a high-profile failure with a strategic account.

Ops assumes zero maintenance burden. Operations teams are accustomed to software that works consistently until something breaks, at which point an engineer fixes it. AI systems have a different failure mode: silent degradation.

A model's behavior can shift over time as upstream data changes, as the provider silently updates the base model, or as real user inputs drift from the distribution the system was designed for. This degradation doesn't trigger an error page or a pager alert. It shows up as a gradual decline in output quality that ops has no framework for detecting.

Operations teams also consistently underestimate the ongoing maintenance burden. They plan for a feature launch, not for the recurring cycle of monitoring, prompt adjustments, and eval reruns that keep an AI feature delivering consistent value.

Why Standard Project Kickoffs Don't Fix This

The typical response to stakeholder alignment challenges is to add more kickoff meetings, more documentation, and more Confluence pages. This doesn't work for AI capability communication, for two reasons.

First, the failure modes of AI systems are largely unintuitive. You can explain in a document that the model "may occasionally produce incorrect output," and everyone will nod and move on, having mapped that statement to their mental model of software bugs — rare, fixable, detectable. The actual failure mode is different: frequent, distributed, sometimes undetectable without domain expertise, and not fixable by restarting the service.

Second, abstract capability statements don't change behavior. Telling legal that the system is "probabilistic not deterministic" doesn't change how they review outputs until they've seen what that means in practice. Telling ops that the system "requires ongoing monitoring" doesn't change their staffing model until they've experienced a degradation incident.

What actually works is a two-part approach: structured capability briefs that make the limits concrete, and calibration demos that make the failure modes visible.

The Capability-and-Limitation Brief

Before any AI feature ships, each stakeholder team should receive a one-page brief that answers five specific questions:

What does this system do reliably? State a specific success rate with a definition. "This system drafts customer email responses with an 88% acceptance rate, where acceptance means a sales rep sends the draft with fewer than 10 words changed." Vague claims like "helps with email" or "improves response time" give stakeholders no basis for evaluating whether the system is performing as expected in production.

What does this system fail on? Name the specific failure modes, not as a legal disclaimer but as operational guidance. "The system frequently invents product features when asked about capabilities outside the current catalog. It sometimes produces outputs in the wrong language when the user's account locale doesn't match their message language. It degrades significantly on requests longer than 800 words." This level of specificity is what enables teams to build appropriate workflow gates around the system.

What does a failure look like? Include an example of a real failure from testing. Not a theoretical one — an actual output the system produced that was wrong. This is the part most engineering teams skip because it feels like advertising the product's weaknesses. It is. Do it anyway. Legal and compliance teams will build their review processes around concrete failure examples in a way they never will around abstract risk descriptions.

What are the prerequisites? State explicitly what conditions the system requires to function at its stated reliability. "The system requires complete account data: missing industry classification reduces acceptance rate to 62%." Sales and ops need these prerequisites to manage expectations and configure their workflows correctly.

How will you know when it's failing? Give each stakeholder team one or two metrics to watch that will surface degradation before it becomes a crisis. Make these observable in dashboards they already use, not new tooling they have to remember to check.

Calibration Demos Are Not Regular Demos

The standard AI demo is a liability. It's curated to show the system at its best, on clean data, in scenarios the team has rehearsed. Every attendee leaves with an impression of reliability that won't survive first contact with production.

A calibration demo is structured differently. It starts by showing the success case, then deliberately shows the failure cases. "Here's what a good output looks like. Now here's what the system does with an account that has stale industry data. Here's what it does when the request is ambiguous. Here's what happens when the system's context window is nearly full and it starts truncating earlier parts of the conversation."

The quantification matters. "This system is right 87% of the time. Here's the 13%." That sentence lands differently than any amount of documentation about probabilistic outputs. And the stakeholder's reaction to seeing the failure cases tells you whether you've achieved alignment. If legal's reaction is "that's fine, we can build a review gate for those cases," you've succeeded. If their reaction is "we can't ship something that does that," you've found a blocker before it became a production incident.

During calibration demos, also show the operational picture: the monitoring dashboard, the alerting thresholds, the process for reporting a degradation. Ops teams need to see that there's a system for detecting and responding to failures, not just a feature that's been handed to them.

The Operational Runbook for Non-Technical Teams

The system documentation that engineering teams produce is usually written for engineers. A production runbook for an AI feature needs a separate, parallel version written for the teams operating around it.

This runbook should define three things:

Scope boundaries. A clear list of what the system should and should not be used for. Written in the language of the team using it, not technical jargon. "Use this system for: initial outreach drafts, bulk content generation, routine support responses. Do not use this system for: legal commitments, statements about product pricing or availability, communications to named enterprise accounts without human review." The system cannot enforce these boundaries; the humans operating around it need to.

Escalation triggers. Specific observable conditions that should trigger a report to engineering. "If you notice the system using a product name that doesn't appear in the product catalog — report it. If the acceptance rate on a team's drafts drops below 70% in a week — report it. If the system produces the same error more than three times in a day — report it." The goal is to build a feedback channel from operators to engineering that doesn't depend on operators understanding the underlying model. They need to know what to look for, not why it happens.

The failure normalization statement. Explicitly state that failures are expected, that the system won't be right 100% of the time, and that the process for handling failures is defined. This sounds obvious but is one of the most important components. Without an explicit normalization of failure, the first significant incident creates a disproportionate loss of confidence in the entire system, and teams either abandon the feature or escalate to executive leadership with a narrative that can't be undone.

The Communication Cadence

Capability briefs and runbooks are launch artifacts. Maintaining aligned expectations in production requires an ongoing communication structure.

Weekly cross-functional syncs between engineering and the primary operating team should include a brief review of the metrics from the capability brief — not because something is wrong, but to establish a shared sense of what normal looks like. When the system does degrade, a team that has been watching metrics weekly can distinguish "within normal variance" from "something changed" in a way that a team seeing the numbers for the first time cannot.

Monthly, each stakeholder team should receive an update that covers: current performance against the stated success rate, any changes to the system that could affect behavior, and any new failure modes identified since launch. This shouldn't be a lengthy report — it should be a short, structured update that stakeholders can read in two minutes. The goal is to prevent the expectation drift that happens when stakeholders update their mental model based on recent experience without any input from engineering.

The quarterly cadence should include a reassessment of the capability brief itself. Has the system's actual success rate changed? Have new failure modes been identified? Have the prerequisites changed? The brief is not a launch document — it's a living description of the system, and it should reflect the system as it exists today.

What This Actually Changes

Engineering leaders who build these communication structures consistently report two outcomes. First, the first production failure is handled as an expected operational event rather than a crisis, because every stakeholder team already had a model for what failure looked like and a process for responding to it. Second, expansion of AI features within the organization moves faster, because the track record of honest capability communication — including the failures — builds a form of trust that oversold, under-delivered pilots cannot.

The teams that win with AI internally are not the ones with the most impressive demos. They're the ones who told every stakeholder team exactly what the system would do wrong, gave them a process for handling it, and then delivered what they promised. That's a communication discipline problem, not a technology problem, and it's one that engineering leaders can solve without waiting for the models to improve.

Companies with AI-ready leadership — those who align organizational expectations with technical reality — are measurably more likely to capture value from AI investments. The bottleneck, consistently, is not the model. It's the conversation around the model.

References:Let's stay in touch and Follow me for more thoughts and updates