We're Rolling Out Agentic Workflows to 50 Engineers - Here's Our Playbook (and What's Already Broken)

We’re a 120-person EdTech startup, and we just started rolling out agentic AI workflows to our 50-person engineering team. I want to share what’s working, what’s broken, and what’s surprised us - because this is messier and more interesting than any case study would suggest.

Context: Why We’re Doing This

The data on AI productivity gains is compelling. Waydev research shows 30-50% productivity increases are possible. Our board is asking why we’re not seeing similar gains.

Fair question. So we decided to go beyond “engineers use Copilot” to actually deploying autonomous agent workflows.

Our Phased Approach

Phase 1: Test Generation and Code Review (Low Risk) - CURRENT

  • Agents automatically generate test cases for new code
  • Agents perform first-pass code review, flag potential issues
  • Humans review agent feedback and make final decisions
  • Status: 2 months in

Phase 2: API Implementation from Specs (Medium Risk) - STARTING NEXT MONTH

  • Product writes API specs
  • Agent generates full implementation (routes, validation, database queries, tests)
  • Senior engineer reviews and approves
  • Status: Pilot with 2 teams

Phase 3: Architectural Refactoring (High Risk) - NOT STARTED

  • Agent analyzes codebase and proposes refactoring opportunities
  • Humans evaluate and decide whether to pursue
  • Status: On hold pending Phase 2 results

What’s Working: Test Generation Success

This has been our clear win.

Before agents:

  • Test coverage averaged 65%
  • Engineers wrote tests grudgingly
  • Edge cases often missed
  • Test quality inconsistent

With agents:

  • Test coverage up to 82% (+17 percentage points)
  • Engineers actually like that agents handle tedious test writing
  • Agents are REALLY good at thinking of edge cases
  • Consistent test patterns across team

Engineers are using agents for:

  • Unit test generation
  • Integration test scaffolding
  • Edge case identification
  • Test data generation

Nobody misses writing boilerplate tests. This feels like a genuine productivity win without significant downsides.

What’s Broken: Pattern Mismatch Problem

This is our biggest pain point right now.

The Problem: Agents generate technically correct code that doesn’t match our internal patterns and conventions.

Example:
Agent generates an API endpoint that:

  • :white_check_mark: Handles all specified requirements
  • :white_check_mark: Has good error handling
  • :white_check_mark: Includes tests
  • :cross_mark: Uses a different authentication middleware pattern than our existing 47 endpoints
  • :cross_mark: Structures error responses differently than our API conventions
  • :cross_mark: Names things inconsistently with our codebase

Result: Engineer spends 30 minutes cleaning up pattern mismatches.

We’re getting the speed benefit of AI generation, but losing it to pattern alignment work.

Unexpected Issue #1: Senior Engineer Threat Response

I didn’t anticipate this, but about 30% of our senior engineers are… anxious.

What they’re saying (in private conversations):

  • “If agents can do what I do, why am I valuable?”
  • “I spent 10 years learning to architect systems, now AI can do it in 10 minutes?”
  • “Am I going to be managed out for being too expensive?”

What they’re doing:

  • Some are embracing it (reframing themselves as “agent orchestrators”)
  • Some are quietly resistant (insisting on doing things manually)
  • Some are job searching (we’ve lost 2 seniors in the past quarter, both mentioned AI concerns in exit interviews)

This is a cultural challenge I wasn’t prepared for.

Unexpected Issue #2: Junior Engineer Over-Dependence

The flip side: Our junior engineers are TOO comfortable with agents.

One junior literally said: “Why would I learn to write SQL queries when the agent does it better than I ever will?”

Valid question, actually. But also terrifying.

We’re seeing:

  • Juniors who can describe features but can’t implement them without AI
  • Difficulty debugging when AI-generated code fails
  • Lack of foundational knowledge about why solutions work

This connects to what Alex was discussing in the other thread about junior development.

The Productivity Measurement Problem

Here’s the uncomfortable question: How do we measure productivity when agents do the work?

Traditional metrics:

  • Story points completed ✓ (up 25%)
  • PRs merged ✓ (up 30%)
  • Features shipped ✓ (up 20%)

But:

  • Bug rate in production ↑ (up 15%)
  • Time to debug issues ↑ (up 20%)
  • Team understanding of codebase ↓ (subjective but noticeable)

So are we more productive, or just moving faster toward technical debt?

Governance We Implemented

Based on early issues, we added requirements:

Agent-Generated Code Must Include:

  1. “AI Usage” section in PR - What did agent generate vs human-written?
  2. Pattern Compliance Check - Does this match our conventions?
  3. Human Review Requirement - +1 from senior engineer before merge
  4. Explanation Requirement - Author must explain how the code works

Agents Cannot:

  • Deploy to production without human trigger
  • Modify database schemas
  • Change authentication or authorization logic
  • Make architectural decisions without human approval

The Culture Challenge

The hardest part isn’t technical - it’s cultural.

We’re seeing three camps emerge:

1. “AI Enthusiasts” (~40%)

  • Love the productivity gains
  • Excited about working at higher abstraction level
  • See agents as tools that amplify their capabilities

2. “Cautious Pragmatists” (~40%)

  • Use agents for some tasks, not others
  • Worried about dependency but see value
  • Want more guardrails and best practices

3. “Resistant Skeptics” (~20%)

  • Prefer to work manually
  • Don’t trust agent-generated code
  • Concerned about long-term implications

How do you manage a team where people have fundamentally different relationships with AI?

What I’m Learning

1. Velocity ≠ Value
Shipping faster doesn’t mean shipping better. We’re learning to distinguish between “more features” and “better product.”

2. Autonomy Requires Trust, Trust Requires Understanding
Teams only trust agent autonomy when they understand what agents can and can’t do well. That requires experimentation and transparency.

3. Change Management Is Harder Than Technology
The technical implementation is straightforward. Getting 50 engineers with different skill levels and attitudes to adopt new workflows? Much harder.

4. Juniors and Seniors Need Different Approaches
What works for seniors (more autonomy, trust their judgment) doesn’t work for juniors (need more structure, oversight).

Questions I’m Wrestling With

1. How do we balance autonomy with oversight at scale?

  • Per-engineer customization? Teams decide their own rules? Company-wide standards?

2. How do we measure what actually matters?

  • Not just velocity, but sustainable velocity
  • Not just features, but maintainable features

3. How do we manage the senior engineer anxiety?

  • Is this just adjustment period or legitimate concern?
  • How do we help them see value in evolving role?

4. What’s the right governance model?

  • Too restrictive = lose productivity gains
  • Too permissive = accumulate problems we’ll regret

What I’d Love to Hear

If you’re rolling out similar workflows:

What governance have you found essential?
How are you handling the culture change?
Have you figured out how to measure productivity meaningfully?
How are you maintaining code quality while increasing velocity?

Because honestly, we’re making this up as we go. And I’d love to learn from others navigating the same transition.

This is the future of engineering work, for better or worse. We might as well figure it out together.

Keisha, thank you for sharing this so transparently. Your experience mirrors ours in many ways, and your willingness to share what’s NOT working is valuable.

On Measuring Productivity

Your question about velocity vs value is THE question. We struggled with the same thing.

Here’s what we changed: We stopped measuring individual productivity and started measuring team outcomes.

Old metrics (misleading):

  • PRs merged per engineer
  • Story points per sprint
  • Lines of code

New metrics (more meaningful):

  • DORA metrics (deployment frequency, lead time, MTTR, change failure rate)
  • Customer-impacting bugs per release
  • Team’s ability to take on complex work (subjective but important)
  • Feature retention (do users actually use what we ship?)

The key insight: AI makes individual engineers faster, but that doesn’t always make the TEAM more effective.

If engineers ship features quickly but those features are buggy, hard to maintain, or not what users needed - that’s not productivity, that’s churn.

On The Senior Engineer Anxiety

This is real and it’s not going away. Here’s what we’re doing:

Reframe the role explicitly:

  • Seniors aren’t “coders who code fast” - they’re architects and decision-makers
  • AI doesn’t replace senior judgment, it amplifies senior leverage
  • The skill that matters is knowing WHAT to build and HOW to design it, not typing speed

Concrete actions:

  • Promoted two seniors to “Staff Engineer - AI Engineering” roles
  • They focus on: agent orchestration, governance frameworks, training others
  • Made it clear: This is a senior-level skill set, not entry-level work

Results:
Most seniors are coming around. Those who insisted “my value is in writing code” - some left, and honestly, that might be okay. The role IS changing.

On Governance That Scales

Your governance model is good, but here’s what we added that helped:

Tiered Approval Based on Blast Radius:

Low Risk (single PR reviewer):

  • Test generation
  • Documentation updates
  • UI styling changes
  • Non-critical bug fixes

Medium Risk (senior engineer approval):

  • New features in existing modules
  • API changes with backward compatibility
  • Database queries (not schema changes)

High Risk (architecture review + approval):

  • New services or major components
  • Breaking changes
  • Security or auth modifications
  • Architectural refactoring

This prevents low-risk work from getting bogged down in process while maintaining oversight on critical decisions.

Offer: Share Our Framework

We’ve documented our “AI-Assisted Development Framework” - governance, best practices, metrics, cultural guidelines.

Would be happy to share it with you offline. Maybe we can collaborate and create something the broader community can use?

Because you’re right: We’re all figuring this out together. No point in everyone reinventing the same wheels.

Your senior engineer anxiety issue hits close to home. I’ve lost one senior to this exact concern, and had difficult conversations with others.

The Psychology of Obsolescence

Here’s what I’m seeing: Seniors built their identity around skills that AI is now commoditizing.

  • “I’m the person who can write complex algorithms” → AI can generate algorithms
  • “I’m the person who debugs hard problems” → AI can suggest debugging approaches
  • “I’m the person who knows the codebase deeply” → AI can analyze codebases

Their expertise feels threatened because it IS being partially automated.

The Reframe That’s Working

What I tell my seniors: Your value isn’t in the tasks you perform, it’s in the judgment you bring.

AI can’t (yet):

  • Understand business context and priorities
  • Make architectural trade-offs based on organizational constraints
  • Navigate political and interpersonal dynamics
  • Mentor and develop other engineers
  • Question whether we’re building the right thing
  • Take accountability for decisions

This is the work that matters. The implementation is increasingly commoditized, but the decision-making is not.

Practical Approach: “Orchestration” Role

We created explicit “Senior Engineer - Agent Orchestration” expectations:

Your job is to:

  1. Define what agents should build (requirements, constraints, patterns)
  2. Review agent-generated solutions for quality and fit
  3. Teach agents (and juniors) the patterns that matter
  4. Make architectural decisions agents can’t make
  5. Own the outcomes even when agents did the implementation

This is senior-level work. It requires experience, judgment, and deep technical knowledge. But it’s focused on decision-making and oversight, not typing.

The Juniors Question

Your observation about junior over-dependence is spot on and connects to my concerns from the other thread.

Here’s our approach:

“No AI Fridays” - One day per week, juniors work without AI assistance:

  • Forces them to build foundational knowledge
  • Develops debugging skills AI can’t teach
  • Builds confidence in manual capabilities

Structured Learning Paths:

  • Month 1-2: Core fundamentals with minimal AI
  • Month 3-4: AI-assisted work with mandatory explanations
  • Month 5+: Full AI usage with understanding requirements

“Explain Before Merge” Policy:
If you can’t explain how the AI-generated code works and why it’s the right approach, it doesn’t merge. Period.

Cultural Framing

You asked how to manage three camps with different attitudes. Here’s what’s working for us:

Frame it as “AI-Native Engineering” not “Using AI”

AI-Native Engineering means:

  • Deep system understanding (foundation)
  • Effective use of AI tools (leverage)
  • Ability to critique AI outputs (judgment)
  • Ownership of outcomes regardless of who wrote code (accountability)

This framing works because:

  • Enthusiasts see it as embracing modern practice
  • Pragmatists see it as balanced approach
  • Skeptics see it values their expertise

We’re not saying “use AI or you’re behind” - we’re saying “be excellent engineers in an AI-enabled world.”

On Your Productivity Metrics

The fact that bugs and debug time are UP is a warning sign. Velocity without quality is just thrash.

Suggestion: Add these metrics:

  • Code Churn - How often do we rewrite recently-written code?
  • “WTF/minute” in code review - How confusing is the code to reviewers?
  • Time to understand - How long does it take new engineer to contribute to a module?
  • Architectural coherence - Subjective but important: Does the system still make sense?

These are leading indicators of whether your velocity is sustainable or accumulating debt.

One More Thought: Celebrate When Humans Catch Agent Mistakes

Make it culturally valued when engineers identify problems in AI-generated code.

We have a #agents-learned channel in Slack where engineers post:

  • Times they caught AI mistakes
  • Patterns AI gets wrong
  • Improvements needed

This does two things:

  1. Makes it clear that human judgment is valued
  2. Helps everyone learn what AI is good/bad at

The message: You’re not competing with AI, you’re teaching it and compensating for its limitations.

That’s a very different framing than “AI is replacing you.”

Quick security perspective on your rollout:

Your Current Phase 1 is Low-Risk (Agree)

Test generation and code review assistance? These are good starting points because:

  • Low blast radius (bad tests don’t cause outages, they just miss bugs)
  • Human review catches issues
  • Good learning environment for team and agents

Your Phase 2 Needs Security Gates

“Agent generates full API implementation” raises concerns:

Before you roll this out, ensure:

:white_check_mark: Security review is mandatory - Not just code review, actual security review
:white_check_mark: Agents can’t modify auth/authz logic - Keep this human-only
:white_check_mark: Input validation is human-verified - AI-generated validation often has gaps
:white_check_mark: SQL injection testing - If agents generate database queries, audit for injection risks
:white_check_mark: Rate limiting and abuse prevention - Agents might miss these non-functional requirements

Specific recommendation: Agent-generated APIs should go through your security team before first production deployment. Treat them like any external code: trust nothing, verify everything.

Pattern Mismatch = Security Risk

Your “agents don’t match our patterns” problem isn’t just aesthetic - it’s a security concern.

Why: Your existing patterns likely encode security lessons learned. When agents apply different patterns, they might reintroduce vulnerabilities you’ve already fixed.

Example:

  • Your authentication middleware has been hardened over 2 years
  • Agent generates endpoint using different auth pattern from docs it found online
  • That pattern has known vulnerabilities
  • You just regressed your security posture

Suggestion: Create “Security Pattern Library” that agents must use. Not suggestions - requirements.

The Bug Rate Increase Is a Warning

You mentioned bugs up 15%. From security perspective: Are those bugs security-relevant?

Track:

  • How many are input validation failures?
  • How many expose data that should be private?
  • How many are authentication/authorization bypasses?
  • How many are injection vulnerabilities?

If AI-generated code is introducing security bugs at higher rate than human code, that’s a problem that requires process changes, not just bug fixes.

Governance Addition: Security Checklist

Add to your PR requirements:

For Agent-Generated Code:

  • Input validation verified by human
  • Authentication/authorization logic reviewed
  • Database queries checked for injection
  • Error messages don’t leak sensitive data
  • Rate limiting considered
  • Security patterns from library used

This is in addition to functional review.

Offer: Security Review Checklist

We’ve created an “AI-Generated Code Security Checklist” - things to watch for when reviewing agent work. Happy to share if useful.

The bottom line: Agent-generated code should go through STRICTER security review than human code, not looser, because agents don’t have security intuition yet.

Love the transparency here! Your “pattern mismatch” problem is exactly what I experienced with AI-generated design components.

The Pattern Problem Is Universal

In design: Agent generates components that technically work but don’t match our design system.

In code: Agent generates code that technically works but doesn’t match your codebase patterns.

Same root cause: Agents optimize for “working” not “consistent.”

Potential Solution: Fine-Tuning on Your Patterns

Have you considered:

  • Training agents on your actual codebase?
  • Providing examples of “this is how WE do authentication” before agent generates?
  • Creating a “pattern library” that agents must reference?

In design, I’m experimenting with:

  1. Feeding our entire design system to the AI as context
  2. Showing 3-4 examples of “our style” before asking for new components
  3. Having AI critique its own output against our patterns

Results: Better pattern matching, but still not perfect. And it adds overhead that reduces the speed gains.

Maybe that overhead is worth it? Or maybe we need to accept that AI-generated work always requires human “polish” to fit organizational style?

On The Three Camps

Your observation about teams splitting into enthusiasts/pragmatists/skeptics resonates.

In design: Same split. Designers who embrace Figma AI, designers who use it selectively, designers who refuse.

What I learned: You can’t force convergence. Different people will have different workflows, and that’s okay.

What you CAN do: Ensure quality standards are met regardless of workflow.

  • Use AI or don’t use AI - your choice
  • But the output must meet our standards - not negotiable

This respects individual working styles while maintaining team quality.

Suggestion: Design-Engineering Collaboration on Agents

Your agents generate full APIs. Do those APIs interact with frontend/design?

If so, consider:

  • Designer-agent pairing for user-facing features
  • Agents implement backend, but designer reviews UX implications
  • Cross-functional review before merge

Because agents might implement API correctly but miss user experience nuances that a designer would catch.

We’re experimenting with this - agents as implementation layer, but humans from multiple disciplines reviewing for their domain concerns.