Skip to main content

2 posts tagged with "safety"

View all tags

Safe & Trustworthy AI Agents and Evidence-Based AI Policy

· 2 min read

Key Topics

  • Exponential growth in LLMs and their capabilities.
  • Broad spectrum of risks associated with AI systems.
  • Challenges in ensuring trustworthiness, privacy, and alignment of AI.
  • Importance of science- and evidence-based AI policy.

Broad Spectrum of AI Risks

  • Misuse/Malicious Use: Scams, misinformation, bioweapons, cyber-attacks.
  • Malfunction: Bias, harm from system errors, loss of control.
  • Systemic Risks: Privacy, labor market impact, environmental concerns.

AI Safety vs. AI Security

  • AI Safety: Prevent harm caused by AI systems.
  • AI Security: Protect AI systems from external threats.
  • Adversarial Settings: Safety mechanisms must withstand attacks.

Trustworthiness Problems in AI

  • Robustness: Safe, effective systems, including adversarial and out-of-distribution robustness.
  • Fairness: Prevent algorithmic discrimination.
  • Data Privacy: Prevent extraction of sensitive data.
  • Alignment Goals: Ensure AI systems are helpful, harmless, and honest.

Training Data Privacy Risks

  • Memorization: Extracting sensitive data (e.g., social security numbers) from LLMs.
  • Attacks: Training data extraction, prompt leakage, and indirect prompt injection.
  • Defenses: Differential privacy, deduplication, and robust training techniques.

Adversarial Attacks and Defenses

  • Attacks:
    • Prompt injection, data poisoning, jailbreaks.
    • Adversarial examples in both virtual and physical settings.
    • Exploiting vulnerabilities in AI systems.
  • Defenses:
    • Prompt-level defenses (e.g., re-design prompts, detect anomalies).
    • System-level defenses (e.g., information flow control).
    • Secure-by-design systems with formal verification.

Safe-by-Design Systems

  • Proactive Defense: Architecting provably secure systems.
  • Challenges: Difficult to apply to non-symbolic components like neural networks.
  • Future Systems: Hybrid symbolic and non-symbolic systems.

AI Policy Recommendations

Key Priorities:

  1. Better Understanding of AI Risks:

    • Comprehensive analysis of misuse, malfunction, and systemic risks.
    • Marginal risk framework to evaluate societal impacts of AI.
  2. Increase Transparency:

    • Standardized reporting for AI design and development.
    • Examples: Digital Services Act, US Executive Order.
  3. Develop Early Detection Mechanisms:

    • In-lab testing for adversarial scenarios.
    • Post-deployment monitoring (e.g., adverse event reporting).
  4. Mitigation and Defense:

    • New approaches for safe AI.
    • Strengthen societal resilience against misuse.
  5. Build Trust and Reduce Fragmentation:

    • Collaborative research and international cooperation.

Call to Action

  • Blueprint for Future AI Policy:
    • Taxonomy of risk vectors and policy interventions.
    • Conditional responses to societal risks.
  • Multi-Stakeholder Collaboration:
    • Advance scientific understanding and evidence-based policies.

Resource: Understanding-ai-safety.org

Measuring Agent Capabilities and Anthropic’s RSP

· 2 min read

Anthropic’s History

  • Founded: 2021 as a Public Benefit Corporation (PBC).
  • Milestones:
    • 2022: Claude 1 completed.
    • 2023: Claude 1 released, Claude 2 launched.
    • 2024: Claude 3 launched.
    • 2025: Advances in interpretability and AI safety:
      • Mathematical framework for constitutional AI.
      • Sleeper agents and toy models of superposition.

Responsible Scaling Policy (RSP)

  • Definition: A framework to ensure safe scaling of AI capabilities.
  • Goals:
    • Provide structure for safety decisions.
    • Ensure public accountability.
    • Iterate on safe decisions.
    • Serve as a template for policymakers.
  • AI Safety Levels (ASL): Modeled after biosafety levels (BSL) for handling dangerous biological materials, aligning safety, security, and operational standards with a model’s catastrophic risk potential.
    • ASL-1: Smaller Models: No meaningful catastrophic risk (e.g., 2018 LLMs, chess-playing AIs).
    • ASL-2: Present Large Models: Early signs of dangerous capabilities (e.g., instructions for bioweapons with limited reliability).
    • ASL-3: Higher Risk Models: Models with significant catastrophic misuse potential or low-level autonomy.
    • ASL-4 and higher: Speculative Models: Future systems involving qualitative escalations in catastrophic risk or autonomy.
  • Implementation:
    • Safety challenges and methods.
    • Case study: computer use.

Measuring Capabilities

  • Challenges: Benchmarks become obsolete.
  • Examples:
    • Task completion time relative to humans: Claude 3.5 completes tasks in seconds compared to human developers’ 30 minutes.
    • Benchmarks:
      • SWE-bench: Assesses real-world software engineering tasks.
      • Aider’s benchmarks: Code editing and refactoring.
  • Results:
    • Claude 3.5 Sonnet outperforms OpenAI o1 across key benchmarks.
    • Faster and cheaper: $3/Mtok input vs. OpenAI o1 at $15/Mtok input.

Claude 3.5 Sonnet Highlights

  • Agentic Coding and Game Development: Designed for efficiency and accuracy in real-world scenarios.
  • Computer Use Demos:
    • Coding: Demonstrated advanced code generation and integration.
    • Operations: Showcased operational tasks with safety considerations.

AI Safety Measures

  • Focus Areas:
    • Scaling governance.
    • Capability measurement.
    • Collaboration with academia.
  • Practical Safety:
    • ASL standard implementation.
    • Deployment safeguards.
    • Lessons learned in year one.

Future Directions

  • Scaling and governance improvements.
  • Enhanced benchmarks and academic partnerships.
  • Addressing interpretability and sleeper agent risks.