No topics yet. Start the conversation.

Summary

User supplied summary for the plot

Claude model coding benchmarks and autonomous task horizon over time

Description

The below description is supplied in free-text by the user

Methodology

Benchmark scores and autonomous task duration for each Claude model release. The Opus 4.5 release (Nov 2025) marks the clear inflection — first model over 80% on SWE-bench Verified, and autonomous coding horizon 5x earlier Opus models.

Metrics

  • SWE-bench Verified — Industry-standard coding benchmark: 500 real GitHub issues, human-validated for solvability. Scores use each vendor's best scaffold unless noted.
  • Autonomous Task Horizon — METR's "50% success time horizon": the length of task a model completes successfully ~50% of the time. Doubling time accelerated from 7 months through 2024 to ~89 days by early 2026 (METR Time Horizon 1.1 report, Feb 2026).

Key Inflection Points

  • Opus 4.5 (Nov 2025) — First model past 80% SWE-bench Verified; 4h49m autonomous horizon. This is where "agentic coding" became viable for real-world software engineering.
  • Opus 4.7 (Apr 2026) — Highest public SWE-bench score ever (87.6%); 6h20m autonomous horizon.

Notes

  • Earlier horizon figures (Claude 3 Opus, 3.5 Sonnet, 3.7 Sonnet) are extrapolated from METR's doubling-curve and are less precise than the Opus 4+ numbers.
  • SWE-bench Pro (Scale's harder variant) only became available mid-2025; earlier models have no Pro score.
  • Data current as of April 2026.

Primary Sources

  • Anthropic — "Introducing Claude Opus 4.5" (Nov 2025) and "Claude Opus 4.7" announcement (Apr 16, 2026)
  • Anthropic research — "Claude 3.5 Sonnet SWE-bench" (Oct 2024)
  • METR — "Time Horizon 1.1" report (Feb 2026) & Opus 4.5 horizon estimate (Dec 2025)
  • SWE-bench Verified leaderboard (swebench.com) & Scale SEAL SWE-Pro leaderboard
  • DataCamp model guides (Sonnet 4.5, Opus 4.5, Sonnet 4.6)
  • AWS Bedrock release notes (Apr 20, 2026) for Opus 4.7