We’re a 120-person EdTech startup, and we just started rolling out agentic AI workflows to our 50-person engineering team. I want to share what’s working, what’s broken, and what’s surprised us - because this is messier and more interesting than any case study would suggest.
Context: Why We’re Doing This
The data on AI productivity gains is compelling. Waydev research shows 30-50% productivity increases are possible. Our board is asking why we’re not seeing similar gains.
Fair question. So we decided to go beyond “engineers use Copilot” to actually deploying autonomous agent workflows.
Our Phased Approach
Phase 1: Test Generation and Code Review (Low Risk) - CURRENT
- Agents automatically generate test cases for new code
- Agents perform first-pass code review, flag potential issues
- Humans review agent feedback and make final decisions
- Status: 2 months in
Phase 2: API Implementation from Specs (Medium Risk) - STARTING NEXT MONTH
- Product writes API specs
- Agent generates full implementation (routes, validation, database queries, tests)
- Senior engineer reviews and approves
- Status: Pilot with 2 teams
Phase 3: Architectural Refactoring (High Risk) - NOT STARTED
- Agent analyzes codebase and proposes refactoring opportunities
- Humans evaluate and decide whether to pursue
- Status: On hold pending Phase 2 results
What’s Working: Test Generation Success
This has been our clear win.
Before agents:
- Test coverage averaged 65%
- Engineers wrote tests grudgingly
- Edge cases often missed
- Test quality inconsistent
With agents:
- Test coverage up to 82% (+17 percentage points)
- Engineers actually like that agents handle tedious test writing
- Agents are REALLY good at thinking of edge cases
- Consistent test patterns across team
Engineers are using agents for:
- Unit test generation
- Integration test scaffolding
- Edge case identification
- Test data generation
Nobody misses writing boilerplate tests. This feels like a genuine productivity win without significant downsides.
What’s Broken: Pattern Mismatch Problem
This is our biggest pain point right now.
The Problem: Agents generate technically correct code that doesn’t match our internal patterns and conventions.
Example:
Agent generates an API endpoint that:
Handles all specified requirements
Has good error handling
Includes tests
Uses a different authentication middleware pattern than our existing 47 endpoints
Structures error responses differently than our API conventions
Names things inconsistently with our codebase
Result: Engineer spends 30 minutes cleaning up pattern mismatches.
We’re getting the speed benefit of AI generation, but losing it to pattern alignment work.
Unexpected Issue #1: Senior Engineer Threat Response
I didn’t anticipate this, but about 30% of our senior engineers are… anxious.
What they’re saying (in private conversations):
- “If agents can do what I do, why am I valuable?”
- “I spent 10 years learning to architect systems, now AI can do it in 10 minutes?”
- “Am I going to be managed out for being too expensive?”
What they’re doing:
- Some are embracing it (reframing themselves as “agent orchestrators”)
- Some are quietly resistant (insisting on doing things manually)
- Some are job searching (we’ve lost 2 seniors in the past quarter, both mentioned AI concerns in exit interviews)
This is a cultural challenge I wasn’t prepared for.
Unexpected Issue #2: Junior Engineer Over-Dependence
The flip side: Our junior engineers are TOO comfortable with agents.
One junior literally said: “Why would I learn to write SQL queries when the agent does it better than I ever will?”
Valid question, actually. But also terrifying.
We’re seeing:
- Juniors who can describe features but can’t implement them without AI
- Difficulty debugging when AI-generated code fails
- Lack of foundational knowledge about why solutions work
This connects to what Alex was discussing in the other thread about junior development.
The Productivity Measurement Problem
Here’s the uncomfortable question: How do we measure productivity when agents do the work?
Traditional metrics:
- Story points completed ✓ (up 25%)
- PRs merged ✓ (up 30%)
- Features shipped ✓ (up 20%)
But:
- Bug rate in production ↑ (up 15%)
- Time to debug issues ↑ (up 20%)
- Team understanding of codebase ↓ (subjective but noticeable)
So are we more productive, or just moving faster toward technical debt?
Governance We Implemented
Based on early issues, we added requirements:
Agent-Generated Code Must Include:
- “AI Usage” section in PR - What did agent generate vs human-written?
- Pattern Compliance Check - Does this match our conventions?
- Human Review Requirement - +1 from senior engineer before merge
- Explanation Requirement - Author must explain how the code works
Agents Cannot:
- Deploy to production without human trigger
- Modify database schemas
- Change authentication or authorization logic
- Make architectural decisions without human approval
The Culture Challenge
The hardest part isn’t technical - it’s cultural.
We’re seeing three camps emerge:
1. “AI Enthusiasts” (~40%)
- Love the productivity gains
- Excited about working at higher abstraction level
- See agents as tools that amplify their capabilities
2. “Cautious Pragmatists” (~40%)
- Use agents for some tasks, not others
- Worried about dependency but see value
- Want more guardrails and best practices
3. “Resistant Skeptics” (~20%)
- Prefer to work manually
- Don’t trust agent-generated code
- Concerned about long-term implications
How do you manage a team where people have fundamentally different relationships with AI?
What I’m Learning
1. Velocity ≠ Value
Shipping faster doesn’t mean shipping better. We’re learning to distinguish between “more features” and “better product.”
2. Autonomy Requires Trust, Trust Requires Understanding
Teams only trust agent autonomy when they understand what agents can and can’t do well. That requires experimentation and transparency.
3. Change Management Is Harder Than Technology
The technical implementation is straightforward. Getting 50 engineers with different skill levels and attitudes to adopt new workflows? Much harder.
4. Juniors and Seniors Need Different Approaches
What works for seniors (more autonomy, trust their judgment) doesn’t work for juniors (need more structure, oversight).
Questions I’m Wrestling With
1. How do we balance autonomy with oversight at scale?
- Per-engineer customization? Teams decide their own rules? Company-wide standards?
2. How do we measure what actually matters?
- Not just velocity, but sustainable velocity
- Not just features, but maintainable features
3. How do we manage the senior engineer anxiety?
- Is this just adjustment period or legitimate concern?
- How do we help them see value in evolving role?
4. What’s the right governance model?
- Too restrictive = lose productivity gains
- Too permissive = accumulate problems we’ll regret
What I’d Love to Hear
If you’re rolling out similar workflows:
What governance have you found essential?
How are you handling the culture change?
Have you figured out how to measure productivity meaningfully?
How are you maintaining code quality while increasing velocity?
Because honestly, we’re making this up as we go. And I’d love to learn from others navigating the same transition.
This is the future of engineering work, for better or worse. We might as well figure it out together.