One post tagged with "Safety"

Measuring Agent Capabilities and Anthropic’s RSP

January 26, 2025 · 2 min read

Founded: 2021 as a Public Benefit Corporation (PBC).
Milestones:
- 2022: Claude 1 completed.
- 2023: Claude 1 released, Claude 2 launched.
- 2024: Claude 3 launched.
- 2025: Advances in interpretability and AI safety:
  - Mathematical framework for constitutional AI.
  - Sleeper agents and toy models of superposition.

Definition: A framework to ensure safe scaling of AI capabilities.
Goals:
- Provide structure for safety decisions.
- Ensure public accountability.
- Iterate on safe decisions.
- Serve as a template for policymakers.
AI Safety Levels (ASL): Modeled after biosafety levels (BSL) for handling dangerous biological materials, aligning safety, security, and operational standards with a model’s catastrophic risk potential.
- ASL-1: Smaller Models: No meaningful catastrophic risk (e.g., 2018 LLMs, chess-playing AIs).
- ASL-2: Present Large Models: Early signs of dangerous capabilities (e.g., instructions for bioweapons with limited reliability).
- ASL-3: Higher Risk Models: Models with significant catastrophic misuse potential or low-level autonomy.
- ASL-4 and higher: Speculative Models: Future systems involving qualitative escalations in catastrophic risk or autonomy.
Implementation:
- Safety challenges and methods.
- Case study: computer use.

Challenges: Benchmarks become obsolete.
Examples:
- Task completion time relative to humans: Claude 3.5 completes tasks in seconds compared to human developers’ 30 minutes.
- Benchmarks:
  - SWE-bench: Assesses real-world software engineering tasks.
  - Aider’s benchmarks: Code editing and refactoring.
Results:
- Claude 3.5 Sonnet outperforms OpenAI o1 across key benchmarks.
- Faster and cheaper: $3/Mtok input vs. OpenAI o1 at $15/Mtok input.

Agentic Coding and Game Development: Designed for efficiency and accuracy in real-world scenarios.
Computer Use Demos:
- Coding: Demonstrated advanced code generation and integration.
- Operations: Showcased operational tasks with safety considerations.

Focus Areas:
- Scaling governance.
- Capability measurement.
- Collaboration with academia.
Practical Safety:
- ASL standard implementation.
- Deployment safeguards.
- Lessons learned in year one.