Skip to main content

Why AI-Generated Terraform and Kubernetes Configs Are Silently Wrong

· 11 min read
Tian Pan
Software Engineer

Most platform engineers have a version of the same story: they asked an AI assistant to scaffold a Terraform module or a Kubernetes deployment manifest, it came back looking completely reasonable, the CI pipeline went green, and weeks later something bad happened. An IAM role with wildcard permissions. An S3 bucket that wasn't supposed to be public. A Kubernetes pod running as root because nobody checked the security context.

The core problem isn't that LLMs write bad syntax — they rarely do. The problem is that IaC correctness has almost nothing to do with syntax. A Terraform file that terraform validate accepts can still deploy a security disaster. A Kubernetes manifest that kubectl apply --dry-run=client accepts can still schedule pods with dangerous capabilities. The tools your CI pipeline uses to check the code are mostly checking the wrong things.

This post is a field guide to the specific ways LLM-generated IaC fails, why those failures survive the checks teams already have in place, and what actually catches them.

The Accuracy Numbers Are Bad

Before getting into the taxonomy, it helps to understand the baseline. The IaC-Eval benchmark — 458 human-curated AWS scenarios evaluated at NeurIPS 2024 — found that the best-performing model at the time achieved a 19.36% pass@1 accuracy on realistic IaC tasks. The deployability-centric DPIaC-Eval benchmark, which measures whether code actually deploys correctly rather than just passing syntax validation, put the state of the art across six major LLMs at 20–30% first-attempt success.

These aren't fringe models. These are the best available tools evaluated on realistic cloud infrastructure tasks. The gap between "syntactically valid" and "actually works and is secure" is enormous.

Teams that don't know these numbers tend to calibrate their review processes around the assumption that AI-generated code is mostly right and needs light checking. The actual calibration should be: AI-generated IaC code is frequently wrong in non-obvious ways, and your reviewers need to know what to look for.

Failure Mode 1: Hallucinated Dependencies and Phantom Attributes

The most common LLM IaC failure is referencing outputs, attributes, or resource types that don't exist — or exist differently than the model thinks.

Common examples:

  • A Lambda function referencing an S3 bucket ARN that isn't exported from the module creating the bucket
  • An IAM policy that references an aws_s3_bucket attribute (bucket_regional_domain_name) that was renamed or removed in a provider version update
  • A Terraform module using count everywhere instead of for_each, which breaks resource addressing when items are added or removed mid-list
  • A moved block missing entirely during a refactor, so Terraform destroys and recreates resources instead of renaming them

The last two are subtle. count-versus-for_each isn't a hallucination — it's a pattern the model learned from older examples that was valid but is now considered a footgun. The missing moved block means a rename becomes a destroy-and-recreate, which in production means downtime and data risk.

These failures don't show up in terraform validate because that command only checks syntax and module structure. They don't show up in terraform plan either — plan will tell you it's going to destroy that database instance, but it won't tell you that was an accident.

Failure Mode 2: Security Misconfigurations That Pass Plan

This category is the most dangerous because everything appears correct at every automation checkpoint.

The security holes that LLM-generated Terraform regularly introduces:

Overly broad IAM policies. Models default to Action: "*" or Resource: "*" more often than they should, especially when the user's prompt is ambiguous about which resources a role needs to access. Least-privilege IAM requires knowing the exact list of required actions and resource ARNs — something the model infers generically rather than computing from actual usage.

Public S3 buckets. If the prompt mentions that something needs to be accessible, the model may add public read ACLs or forget to include aws_s3_bucket_public_access_block. This passes plan because public access is a valid configuration.

Hardcoded secrets in HCL. Database passwords and API keys end up as literal strings in .tf files, which then get committed to version control and written to Terraform state files. Both leak credentials persistently.

Unencrypted storage. EBS volumes, RDS instances, and S3 buckets created without encrypted = true or equivalent. Cloud provider defaults often leave encryption off — the model generates what it's seen in training examples, which skews toward whatever was common when the examples were written.

Security groups with 0.0.0.0/0 on sensitive ports. Port 22 open to the internet is a consistent pattern in models trained on tutorial code. It's technically valid infrastructure that a security scanner should catch.

The common thread: terraform plan reports these as successful, non-erroring resource configurations. The misconfiguration is semantic, not syntactic.

Failure Mode 3: Outdated Provider Patterns

Terraform provider APIs change. AWS releases breaking updates. Kubernetes API versions deprecate. The model's training data represents a snapshot of documentation and community examples from some point in the past, and that snapshot doesn't stay current.

This surfaces as:

  • Terraform AWS provider v4 attributes that were renamed or removed in v5 (the v4-to-v5 migration had significant breaking changes around S3, DynamoDB, and EC2 resources)
  • Kubernetes manifests using apps/v1beta1 instead of apps/v1, which was removed in Kubernetes 1.16
  • CDK constructs that reference L2 patterns that were refactored in later versions
  • CloudFormation resource properties that changed schema between when the training examples were written and now

The operational effect: the code deploys successfully in one environment running an older provider version, then fails or behaves differently in another. Or it works today and fails after a provider upgrade that nobody realized would affect existing code.

Failure Mode 4: Kubernetes-Specific RBAC and Security Context Holes

Kubernetes manifests have a distinct failure surface. Research on LLM-generated Kubernetes YAML found that 35.8% of manifests included quality issues — and the Kubernetes security scanner data shows how common these are across open-source manifests generally.

The categories that recur most often in AI-generated configs:

Missing resource limits. Containers without resources.limits set can consume unbounded CPU and memory, enabling noisy-neighbor effects and in the worst case node pressure that evicts other workloads. Approximately 58% of open-source Kubernetes manifests lack resource limits — the base rate is high, and AI-generated code doesn't improve it.

Insecure security context defaults. Containers running as root, privileged: true when not required, missing allowPrivilegeEscalation: false, and missing readOnlyRootFilesystem are all patterns that appear in training examples and get reproduced without modification. The model generates what it learned, and what it learned often wasn't hardened.

Over-permissive RBAC. ClusterRole bindings for workloads that only need namespace-scoped access. cluster-admin bindings for service accounts that need to list pods. The model's default mental model for "give this workload API access" skews too broad.

Deprecated API versions. Kubernetes has a strict deprecation cycle that the model's training data may not reflect. A manifest that works against one cluster version fails against another when the API group is removed.

Missing liveness and readiness probes. Without probes, Kubernetes has no signal for whether a pod is healthy. Traffic routes to pods that are running but not serving. Deployments complete even when the new pods are broken.

Why Your Existing Checks Miss Most of This

Teams that use AI-generated IaC typically already run terraform validate, terraform plan, and some linting. That catches less than people think:

  • terraform validate checks syntax and module structure. It doesn't check IAM semantics, resource attribute validity against the remote API, or security posture.
  • terraform plan shows what will be created, modified, or destroyed. It doesn't evaluate whether those resources are correctly configured for security. An IAM policy with "Action": "*" shows up in the plan as a valid resource — it's a plan-time success and a security failure simultaneously.
  • kubectl apply --dry-run=client validates against the local schema. It doesn't check API version compatibility against the actual cluster server version, resource limits, security context settings, or RBAC scope.

The gap between what these tools validate and what correctness actually requires is where AI-generated IaC failures live.

What Actually Catches These Failures

The toolchain that covers the gaps:

Static analysis with security semantics. Checkov and tfsec/Trivy analyze Terraform and Kubernetes configs against security rule libraries — 2,000+ built-in policies for Checkov. They catch the IAM wildcards, the public S3 configs, the unencrypted storage, and the missing Kubernetes security contexts that terraform plan ignores. These should run in CI on every PR, not optionally.

Policy-as-code with hard enforcement gates. OPA (Open Policy Agent) with conftest lets teams encode organization-specific security requirements as Rego policies evaluated against terraform plan JSON output. HashiCorp Sentinel serves the same purpose in the HashiCorp ecosystem. The critical detail: policies should have a hard-mandatory mode that blocks apply, not advisory mode that reports warnings nobody reads.

Kubernetes-specific linters. kube-linter and kube-score perform static analysis on Kubernetes YAML, flagging missing resource limits, insecure security contexts, missing probes, and RBAC over-permissions. kubectl apply --dry-run=server (note: server-side, not client-side) validates against the actual cluster API and catches deprecated API versions.

Provider version pinning and drift detection. Pinning Terraform provider versions in required_providers prevents the training-data-version mismatch from causing silent behavior changes. Drift detection tools that continuously compare declared state to actual state catch when deployed infrastructure has diverged from what the code says.

Mandatory plan review with a security lens. Automated gates catch the known pattern classes. They don't catch semantic errors that are valid configurations but wrong for the specific context — a security group that correctly allows port 443 inbound but also allows port 22 that someone forgot to remove. Human review of the plan diff needs to include explicit IAM, network, and encryption spot-checks, not just structural validation that the right resource types are being created.

The Comprehension Gap Problem

There's a second-order risk that's harder to gate against: teams that delegate IaC authoring to AI assistants build velocity but lose their mental model of the infrastructure.

Every accepted suggestion that you didn't fully trace through is a gap in your understanding of what's deployed. The gaps accumulate. An engineer who writes Terraform line by line develops intuitions about what each resource does and how they relate. An engineer who reviews AI-generated modules at a structural level doesn't develop the same intuitions at the same rate.

This matters when things go wrong. Incidents require people who understand the infrastructure well enough to reason about failure modes quickly. A team that has offloaded IaC authoring to AI assistants without compensating with deliberate review practices will find, at the worst time, that nobody has a clear picture of what's running.

The fix isn't to stop using AI assistance — it's to be deliberate about where the comprehension gap is. Require that reviewers can explain each resource's security posture before approving. Run periodic architecture reviews of AI-generated modules. Don't let the mental model rot.

What Changes in Your Review Process

The practical adjustments that follow from this:

  • Run Checkov and kube-linter in CI as blocking checks. Not advisory. Treat a failing security scan the same way you treat a failing test.
  • Add OPA or Sentinel policies for your organization's non-negotiable security requirements — IAM wildcards prohibited, encryption required, specific CIDR restrictions. Make these hard-mandatory.
  • Use --dry-run=server for Kubernetes manifests, not client-side dry-run, so deprecated API versions are caught against your actual cluster version.
  • When reviewing AI-generated IaC, explicitly check: IAM permissions (are these least-privilege?), network exposure (what's open to the internet?), encryption settings, hardcoded secrets. These are the categories where LLMs most reliably miss.
  • Don't treat a green terraform plan as validation that the configuration is correct. Plan is necessary, not sufficient.

Conclusion

LLM-generated IaC is genuinely useful for scaffolding and boilerplate. The failure to understand what it gets wrong — and what your existing checks don't catch — is what turns that utility into production incidents.

The failure modes are predictable: hallucinated dependencies, security misconfigurations that pass plan, outdated provider patterns, and Kubernetes-specific holes around RBAC and security context. The mitigations are also well-understood: static analysis with security semantics, policy-as-code with hard enforcement, server-side dry-runs, and review processes that explicitly look for the categories that automation misses.

What's less understood is the comprehension gap — the erosion of your team's mental model of the infrastructure when authoring is delegated too completely. That's a slower problem, and a harder one to gate against, but it's where the long-run risk concentrates.

References:Let's stay in touch and Follow me for more thoughts and updates