Measuring Agent Capabilities and Anthropic’s RSP
· 2 min read
Anthropic’s History
- Founded: 2021 as a Public Benefit Corporation (PBC).
- Milestones:
- 2022: Claude 1 completed.
- 2023: Claude 1 released, Claude 2 launched.
- 2024: Claude 3 launched.
- 2025: Advances in interpretability and AI safety:
- Mathematical framework for constitutional AI.
- Sleeper agents and toy models of superposition.
Responsible Scaling Policy (RSP)
- Definition: A framework to ensure safe scaling of AI capabilities.
- Goals:
- Provide structure for safety decisions.
- Ensure public accountability.
- Iterate on safe decisions.
- Serve as a template for policymakers.
- AI Safety Levels (ASL): Modeled after biosafety levels (BSL) for handling dangerous biological materials, aligning safety, security, and operational standards with a model’s catastrophic risk potential.
- ASL-1: Smaller Models: No meaningful catastrophic risk (e.g., 2018 LLMs, chess-playing AIs).
- ASL-2: Present Large Models: Early signs of dangerous capabilities (e.g., instructions for bioweapons with limited reliability).
- ASL-3: Higher Risk Models: Models with significant catastrophic misuse potential or low-level autonomy.
- ASL-4 and higher: Speculative Models: Future systems involving qualitative escalations in catastrophic risk or autonomy.
- Implementation:
- Safety challenges and methods.
- Case study: computer use.
Measuring Capabilities
- Challenges: Benchmarks become obsolete.
- Examples:
- Task completion time relative to humans: Claude 3.5 completes tasks in seconds compared to human developers’ 30 minutes.
- Benchmarks:
- SWE-bench: Assesses real-world software engineering tasks.
- Aider’s benchmarks: Code editing and refactoring.
- Results:
- Claude 3.5 Sonnet outperforms OpenAI o1 across key benchmarks.
- Faster and cheaper: $3/Mtok input vs. OpenAI o1 at $15/Mtok input.
Claude 3.5 Sonnet Highlights
- Agentic Coding and Game Development: Designed for efficiency and accuracy in real-world scenarios.
- Computer Use Demos:
- Coding: Demonstrated advanced code generation and integration.
- Operations: Showcased operational tasks with safety considerations.
AI Safety Measures
- Focus Areas:
- Scaling governance.
- Capability measurement.
- Collaboration with academia.
- Practical Safety:
- ASL standard implementation.
- Deployment safeguards.
- Lessons learned in year one.
Future Directions
- Scaling and governance improvements.
- Enhanced benchmarks and academic partnerships.
- Addressing interpretability and sleeper agent risks.