AI Has This 'Overwhelming Tendency' to Ignore Our Repo Conventions—How Are You Enforcing Consistency at Scale?

AI Has This “Overwhelming Tendency” to Ignore Our Repo Conventions—How Are You Enforcing Consistency at Scale?

Three weeks ago, I reviewed a PR where our AI coding assistant generated a beautiful, working feature… in completely the wrong architectural style. The code worked. Tests passed. But it ignored every convention we’d built over two years of microservices evolution.

The problem is getting real. 93% of developers are using AI coding tools, and 41% of all code in 2026 is AI-generated. That’s not a pilot anymore—that’s production at scale. But here’s what nobody’s talking about: AI has this overwhelming tendency to not understand what the existing conventions are within a repository.

What We’re Seeing in Our Financial Services Codebase

I lead a team of 40+ engineers maintaining critical banking infrastructure. Here’s what architectural drift looks like in practice:

Naming inconsistency: AI toggles between camelCase and snake_case based on… vibes? One service uses getUserBalance, the next uses get_account_balance. Grep becomes useless. Refactoring becomes dangerous.

Architecture bypass: We have battle-tested middleware for transaction logging, rate limiting, and compliance auditing. AI-generated services wire around it because the agent doesn’t understand why those layers exist. Now we have compliance gaps that only surface during audits.

Documentation drift: Our team wiki says “all REST endpoints must include OpenAPI specs.” AI generates perfect implementations with zero documentation. Six months later, nobody remembers what those endpoints do.

The statistics validate what we’re experiencing: 96% of developers don’t fully trust the functional accuracy of AI-generated code. We’re all working with tools we fundamentally don’t trust to follow our rules.

Why Documentation Alone Doesn’t Work

We tried the standard approach: CODING_STANDARDS.md, ARCHITECTURE.md, AGENTS.md explaining our patterns with examples. The AI reads them. Sometimes follows them. Often… invents its own interpretation.

Here’s the core issue: Documentation files are fundamentally suggestions, not guarantees. AI can choose to ignore documentation, but it cannot ignore linting errors in your CI pipeline.

As Factory AI puts it: “Linters turn human intent into machine-enforced guarantees that allow agents to plan, generate, and self-correct without waiting on humans.”

The Question I’m Wrestling With

We’re at an inflection point. In our organization, AI is writing nearly half our code. But if we don’t solve architectural drift now, we’re building a maintenance crisis for 2028.

I want to hear from this community:

What’s actually working for you?

  • Are you using enhanced linting strategies that catch semantic issues?
  • Have you found CI/CD automation that enforces conventions before merge?
  • Are you doing context engineering (structured files that shape AI understanding)?
  • Have you tried tools like Drift that learn your patterns?

What’s the right balance?

  • How strict do you get without killing AI’s velocity benefits?
  • Do you block PRs automatically or flag for human review?
  • How do you handle the cases where AI’s “wrong” approach is actually… better?

How do you measure success?

  • Convention adherence rates?
  • Time saved in code review?
  • Reduction in refactoring PRs?

We’re running a pilot next quarter to test AI-powered custom linting (sending diffs + style rules to an LLM for semantic review). But I’m not convinced we have the right mental model yet.

What am I missing? What’s working in your organizations?


Luis Rodriguez
Director of Engineering, Financial Services
Formerly Intel, Adobe

Oh wow, Luis—this hits SO close to home. :bullseye:

We’re dealing with the exact same issue in design systems land. Except instead of architectural middleware, it’s design tokens and accessibility standards getting bypassed.

The UI Component Nightmare

Last month, an engineer (fantastic dev, not their fault) used an AI assistant to generate a new dashboard component. The code looked beautiful. Rendered perfectly in the demo. Shipped to staging.

Then our accessibility audit flagged it: missing ARIA labels, broken keyboard navigation, insufficient color contrast. All things our design system components handle automatically. The AI just… didn’t know they existed.

The developer said, “The AI gave me working code in 5 minutes. I didn’t think to check if we already had these patterns.”

That’s the trap. AI optimizes for “does it work?” not “does it fit?”

What’s Actually Working: Two-Layer Enforcement

We’re experimenting with something similar to what you’re describing:

Layer 1: Fast deterministic lints (ESLint, Prettier, Stylelint)

  • Runs in seconds
  • Catches syntax, formatting, basic pattern violations
  • Zero debate—fix it or CI fails

Layer 2: Semantic LLM review (custom script we built)

  • Sends PR diff + our style guide to an LLM
  • Reviews for: naming quality, documentation completeness, design system adherence, a11y patterns
  • Posts structured review comments with suggestions
  • Takes ~30 seconds but catches things traditional linters miss

The key insight from this DEV article really shaped our thinking: “Each rule you codify reduces review overhead, eliminates a class of regressions, and turns drift into an auto-fixed diff.”

Every lint-green merge makes the repo more teachable—not just for humans, but for the AI agents learning from it.

The Balance Question You Asked

Here’s where I’m torn: How strict do we get without killing AI’s velocity benefits?

If we make the linting too aggressive, developers will:

  1. Disable it locally (already seeing this)
  2. Work around it with // eslint-disable comments
  3. Lose the 21% speed boost AI coding gives us

But if we’re too permissive, we’re just moving code review time to refactoring time six months later. That’s not a win.

My current theory: Start with warnings, not blockers. Track which warnings get ignored most often. Those reveal either:

  • Bad rules that need revision, OR
  • Training gaps where the team doesn’t understand the “why”

Then gradually promote the most-ignored warnings to blockers once the team internalizes the pattern.

Questions Back to You

  1. When your AI bypassed the middleware, did code review catch it? Or did it surface later in production/audit?
  2. Are you thinking about AI-specific linting (rules that specifically check for AI antipatterns), or just enforcing existing standards harder?
  3. Have you looked at tools like CodeRabbit that combine AST analysis with LLMs?

I keep coming back to this: We’re not just teaching AI to write our code. We’re teaching it to write our code. That cultural transmission is everything.

—Maya

Luis and Maya—you’re both identifying the symptoms. Let me share the strategic risk this creates at organizational scale.

The Productivity Paradox Nobody’s Talking About

Here’s the uncomfortable truth: 93% of organizations have adopted AI coding tools, but they’re only seeing 10% productivity gains.

Why? Because we’re optimizing the wrong part of the software development lifecycle.

AI makes developers 30% faster at writing code. Great. But it makes them only 8% faster at delivery because testing, reviews, dependencies, and rework dominate cycle time.

When that rework includes “undo the architectural drift AI introduced,” you’re not 8% faster. You’re slower.

What We’re Doing: Machine-Enforced Guarantees

At our mid-stage SaaS company (120 engineers, $50M ARR), we had the same problem last year. Our solution:

CI pipeline blocks on custom lint rules before any PR can merge.

Not recommendations. Not warnings. Blockers.

Our rules encode:

  • API design patterns (RESTful conventions, error response structures)
  • Database access patterns (must use query builder, no raw SQL outside specific paths)
  • Security requirements (authentication checks, input validation)
  • Observability standards (logging, metrics, tracing instrumentation)

The insight from Factory AI was transformative: “Linters turn human intent into machine-enforced guarantees that allow agents to plan, generate, and self-correct without waiting on humans.”

This isn’t about AI vs. humans. It’s about codifying the “why” so AI can self-correct before consuming human review time.

The Implementation Reality

Here’s what happened when we rolled this out:

Month 1: Developer rebellion. “This slows us down.” “AI is smarter than these rules.” “We’re wasting time on lint fixes.”

Month 2: Grudging acceptance. Code review time dropped 35%. Fewer “why didn’t you use our existing pattern?” comments.

Month 3: Believers. Engineers started adding lint rules for patterns they wanted enforced. AI-generated code improved because the agent learned from cleaner examples.

Month 6: Now it’s cultural. New engineers onboarding ask, “What do the lint rules say?” before asking senior engineers. The rules became our living documentation.

The Measurement Framework

Luis asked how to measure success. Here’s what we track:

  1. Convention adherence rate: % of merged PRs with zero style/pattern violations
  2. Time to first approval: How long before a PR gets its first approving review
  3. Refactoring PR frequency: How often we open PRs specifically to “fix AI-generated code”
  4. AI code retention rate: What % of AI-generated code survives to production vs. gets rewritten

Our current numbers:

  • Adherence: 94% (up from 67% pre-enforcement)
  • Time to approval: 4.2 hours avg (down from 11 hours)
  • Refactoring PRs: -60%
  • AI retention: 89% (was 71%)

The ROI is clear. We invested 2 engineer-months building the custom linting framework. We’re saving ~40 engineer-hours per week in code review and rework.

The Strategic Question

Maya’s right that this is about teaching AI to write your code, not generic code. But Luis, you asked the deeper question: What am I missing?

Here’s what I think you’re missing: You need to codify the “why” before you scale AI adoption.

Most organizations are doing this backward:

  1. Adopt AI tools (93% of us)
  2. Experience drift and quality issues
  3. Try to add guardrails retroactively

The right sequence:

  1. Audit your current conventions (what are the patterns you actually follow?)
  2. Encode them as enforceable rules (linters, CI checks, pre-commit hooks)
  3. Document the “why” in human-readable context files (AGENTS.md, PATTERNS.md)
  4. Then scale AI adoption with those guardrails in place

We’re now piloting agentic AI coding systems (autonomous agents that plan, test, refine) because we trust the enforcement layer. Without it? I’d block that rollout entirely.

Question for This Thread

How many of you are experiencing the productivity paradox? High AI adoption, modest gains, unclear why?

I suspect the answer is architectural drift. You’re writing code faster but fixing it longer.

—Michelle Washington
CTO, Mid-stage SaaS
Formerly Twilio, Microsoft

Michelle’s metrics are :fire:—that ROI is undeniable. But I want to add the people dimension that often gets missed in these technical enforcement discussions.

The Developer Experience Problem

We rolled out strict AI convention enforcement at our EdTech startup six months ago. Michelle’s timeline tracks almost exactly with ours: rebellion → acceptance → advocacy.

But here’s what we learned: The Month 1 rebellion isn’t about the rules. It’s about trust.

When developers say “this slows us down,” what they’re really saying is:

  • “I don’t understand why these rules exist”
  • “I don’t trust that the rules are right”
  • “I feel like I’m being micromanaged by a bot”

We made the enforcement work by investing in the context engineering upfront, not just the enforcement layer.

What Context Engineering Looks Like in Practice

Our approach: AGENTS.md explains the “why” with examples, then we encode those patterns as rules.

Here’s a real example from our codebase:

AGENTS.md:

## Database Access Pattern

WHY: We use TypeORM query builder (not raw SQL) for all data access outside the analytics service.

REASON: Security (parameterized queries prevent SQL injection), testability (we can mock the query builder), and type safety (compile-time checks on schema changes).

EXAMPLES:
✅ GOOD: userRepo.findOne({ where: { email } })
❌ BAD: db.query('SELECT * FROM users WHERE email = ?', [email])

EXCEPTION: The analytics service uses raw SQL for complex reporting queries that TypeORM can't express efficiently. See analytics/README.md.

Then the lint rule enforces it. But developers understand the “why” before they hit the blocker.

The Danger of Outdated Context

Luis mentioned architectural drift. We experienced context drift that was just as dangerous.

Last month, an autonomous AI agent generated authentication code that referenced our old user.role field. That field had been migrated to a permissions system three months ago.

Why? Because our AGENTS.md hadn’t been updated after the refactor.

The agent wired the auth checks through a deprecated path. Tests passed (legacy path still worked). Code review missed it (looked reasonable). Deployed to staging. Security audit flagged it.

This taught us: Context files are living documentation that must be maintained with the same rigor as code.

Now we have:

  • A CI check that flags AGENTS.md if it hasn’t been updated in 90 days
  • Required AGENTS.md updates in our “Definition of Done” for refactoring PRs
  • Quarterly context audits where senior engineers verify accuracy

The Change Management Reality

Maya asked: How strict do we get without killing AI’s velocity benefits?

Here’s how we’re balancing it:

Gradual rollout with clear team training:

  1. Week 1: Introduce the rules as “warnings only” in CI
  2. Week 2: Team meeting where we walk through the “why” for each rule
  3. Week 3: Office hours where engineers can challenge rules they think are wrong
  4. Week 4: Promote to blocking errors, but with an escape hatch (senior eng approval can override)
  5. Month 2: Remove the escape hatch for rules with 95%+ adherence

This gives the team agency. They’re not fighting the rules—they’re collaborating on them.

The Cultural Shift Question

Michelle’s right that this becomes cultural when engineers start adding rules for patterns they want enforced.

But here’s the uncomfortable question: How do we bring teams along vs. impose top-down restrictions?

In my experience, top-down enforcement without buy-in creates:

  • Workarounds (// eslint-disable everywhere)
  • Resentment (“leadership doesn’t trust us”)
  • Attrition (high performers leave for environments with more autonomy)

The solution isn’t less enforcement—it’s more participation.

Our best rules came from junior engineers who said, “I keep seeing AI make this mistake. Can we block it?” When the team owns the rules, they enforce themselves.

What’s Working: The Hybrid Approach

Our current system:

  • Hard blockers: Security, accessibility, legal compliance (non-negotiable)
  • Soft warnings: Style, naming, documentation (tracked but not blocking)
  • Team-proposed rules: Engineers submit proposals, we vote quarterly on promoting warnings to blockers

This gives us Michelle’s enforcement rigor with developer agency.

Metrics we track:

  • Developer satisfaction scores (quarterly survey)
  • Time to resolve lint failures (are the rules clear enough?)
  • Rule override requests (which rules are getting bypassed and why?)

Current results:

  • Dev satisfaction: 7.8/10 (up from 6.1 pre-enforcement)
  • Time to resolve: 3.2 min avg
  • Override requests: Down 85% since we clarified the “why”

Back to Luis’s Original Question

You asked: What’s working in your organizations?

My answer: Enforcement works when context comes first.

AI agents don’t rebel against rules—they follow them literally. It’s the humans who need to understand the “why” before they’ll enforce the rules on themselves and the AI they’re working with.

—Keisha Johnson
VP Engineering, High-Growth EdTech
Formerly Google, Slack

Coming at this from the product side—all three of you are describing technical solutions to what I see as a velocity vs. quality trade-off that directly impacts customers.

The Product Perspective: Speed That Creates Debt

Luis’s original story resonates because it’s exactly what we’re seeing on the product delivery side:

AI lets us ship features faster, but we’re creating support debt from inconsistent UX patterns.

Last quarter:

  • AI helped us ship a new dashboard feature in 2 weeks (would’ve been 6 weeks manually)
  • Support tickets for that feature: 3x higher than our baseline
  • Why? The UI components didn’t match our existing design system, so users couldn’t find controls where they expected them

We saved 4 weeks in engineering time. We spent 6 weeks in support firefighting.

That’s not a win. That’s technical debt disguised as velocity.

The ROI Question Nobody’s Asking

Michelle shared incredible metrics: 2 engineer-months building the framework, saving 40 engineer-hours/week.

Let me translate that to product economics:

Investment:

  • 2 engineers × 4 weeks = 8 engineer-weeks
  • At $150k loaded cost = ~$25k investment

Return:

  • 40 hours/week saved = 1 FTE equivalent
  • $150k annual cost → $12.5k/month → 2-month payback

That’s a no-brainer ROI from a business perspective. But here’s the part most engineering teams don’t surface to product leadership:

What’s the cost of NOT doing this?

If architectural drift leads to:

  • Slower feature delivery (because rework takes longer than initial build)
  • Higher defect rates (because inconsistent code is harder to test)
  • Support escalations (because UX inconsistency confuses users)

Then the opportunity cost is what you’re not shipping because you’re fixing AI-generated technical debt.

What I’m Advocating For: Pilot Programs

Luis, you mentioned running a pilot next quarter. Here’s how I’d structure that from a product perspective:

Pick one team. Measure before/after.

Before metrics (baseline for 4 weeks):

  • Time from feature request → production
  • Code review cycles per PR
  • Post-release defect rate
  • Support tickets per feature

Intervention: Add the enforcement layer

  • Custom linting (semantic + style)
  • AGENTS.md context engineering
  • CI blockers for violations

After metrics (4 weeks post-rollout):

  • Same metrics as baseline
  • Plus: Convention adherence rate, time spent on lint fixes

The hypothesis: Enforcement increases code quality faster than it slows velocity.

If you’re right, you’ll see:

  • Same or better time to production (because less rework)
  • Fewer review cycles (because AI self-corrects)
  • Lower defect rates (because patterns are enforced)
  • Fewer support tickets (because UX/behavior is consistent)

If you’re wrong, you’ll see velocity drop without commensurate quality gains. Then you know to adjust the approach.

Tools Worth Considering

A few folks asked about tooling. From my research with our eng team:

Drift: Scans your codebase, learns patterns, gives AI agents deep understanding of conventions. We’re piloting this.

CodeRabbit: AI-native linter that combines AST analysis with LLMs for semantic review. Does what Maya described (two-layer enforcement) out of the box.

Qodo AI: Focuses on compliance and security standards enforcement—critical for Luis’s financial services use case.

Rulens: Converts your existing lint rules into AI-friendly guidelines. Interesting for teams with established linting but no AI-specific context.

The key insight from OnSpace AI: AI-generated code consistency isn’t just about linting—it’s about maintaining team style guides as structured, machine-readable context.

The Strategic Product Question

Keisha asked: How do we bring teams along vs. impose top-down restrictions?

From a product lens, I’d flip this: How do we make consistency a competitive advantage?

If your AI-assisted teams can ship features:

  • Faster (because less rework)
  • With higher quality (because patterns are enforced)
  • More consistently (because UX conventions are automatic)

Then you’re not imposing restrictions—you’re removing friction. The team’s output becomes more predictable, which means product can commit to roadmaps with confidence.

That’s the business case for this investment.

What I’d Love to See

Can we get more data on the pilot programs you’re all running?

Specifically:

  • What metrics moved (and which didn’t)?
  • What was the team adoption curve (how long to overcome resistance)?
  • What’s the maintenance burden for the enforcement layer (how much ongoing work)?

I want to bring this to our leadership team, but I need the business case to be airtight.

—David Chen
Product Manager, SaaS Platform
Formerly Atlassian, Notion