Skip to main content

14 posts tagged with "ai-agents"

View all tags

Measuring AI Agent Autonomy in Production: What the Data Actually Shows

· 7 min read
Tian Pan
Software Engineer

Most teams building AI agents spend weeks on pre-deployment evals and almost nothing on measuring what their agents actually do in production. That's backwards. The metrics that matter—how long agents run unsupervised, how often they ask for help, how much risk they take on—only emerge at runtime, across thousands of real sessions. Without measuring these, you're flying blind.

A large-scale study of production agent behavior across thousands of deployments and software engineering sessions has surfaced some genuinely counterintuitive findings. The picture that emerges is not the one most builders expect.

Agents for Software Development

· 2 min read

Software’s Impact

  • Software is transforming industries, as predicted by Marc Andreessen (2011).
  • Potential impact of enabling everyone to write software to achieve their goals.

Software Development Workflow

  • Time allocation:
    • 17% Coding
    • 36% Bugfixing
    • 10% Testing
    • 8% Documentation/Reviews
    • 14% Communication
    • 15% Other tasks

Development Tools

  • Copilots:
    • Synchronous support for writing code (e.g., GitHub Copilot).
  • Development Agents:
    • Autonomous tools for coding (e.g., SWE-Agent, Aider) and broader tasks (e.g., Devin, OpenHands).

Challenges in Coding Agents

  • Defining the environment.
  • Designing observation/action spaces.
  • File localization and code generation.
  • Planning, error recovery, and ensuring safety.

Software Development Environments

  • Actual Environments:
    • Source repositories, task management software, office tools, communication tools.
  • Testing Environments:
    • Focused on coding, sometimes includes browsing tasks.

Metrics and Datasets

  • Pass@K (Chen et al., 2021): Measures success rates of generated code passing unit tests.
  • Semantic Overlap Metrics:
    • BLEU, CodeBLEU, CodeBERTScore.
  • Key Datasets:
    • HumanEval, ARCADE, SWEBench, Design2Code.

Solutions for File Localization

  1. User Input: Relies on experienced users to specify files.
  2. Search Tools: Integrated search capabilities (e.g., SWE-Agent).
  3. Repository Mapping: Prebuilt maps (e.g., Aider repomap).
  4. Retrieval-Augmented Generation: Combine retrieved code and LMs.

Planning and Recovery

  • Hard-coded Processes: Predefined steps for file localization, patch generation, etc.
  • LLM-Generated Plans: Use LMs for planning and execution (e.g., CodeR).
  • Revisiting Errors: Automated fixes based on error messages (e.g., InterCode).

Safety Measures

  1. Sandboxing: Limit execution environments (e.g., Docker).
  2. Credentialing: Principle of least privilege.
  3. Post-hoc Auditing: Security analysis using LMs and other tools.

Future Directions

  • Enhance agentic training methods.
  • Expand human-in-the-loop approaches.
  • Address broader software tasks beyond coding.

Resources

  • OpenHands Repository: GitHub