No topics yet. Start the conversation.
Summary
User supplied summary for the plot
Claude model coding benchmarks and autonomous task horizon over time
Description
The below description is supplied in free-text by the user
Methodology
Benchmark scores and autonomous task duration for each Claude model release. The Opus 4.5 release (Nov 2025) marks the clear inflection — first model over 80% on SWE-bench Verified, and autonomous coding horizon 5x earlier Opus models.
Metrics
- SWE-bench Verified — Industry-standard coding benchmark: 500 real GitHub issues, human-validated for solvability. Scores use each vendor's best scaffold unless noted.
- Autonomous Task Horizon — METR's "50% success time horizon": the length of task a model completes successfully ~50% of the time. Doubling time accelerated from 7 months through 2024 to ~89 days by early 2026 (METR Time Horizon 1.1 report, Feb 2026).
Key Inflection Points
- Opus 4.5 (Nov 2025) — First model past 80% SWE-bench Verified; 4h49m autonomous horizon. This is where "agentic coding" became viable for real-world software engineering.
- Opus 4.7 (Apr 2026) — Highest public SWE-bench score ever (87.6%); 6h20m autonomous horizon.
Notes
- Earlier horizon figures (Claude 3 Opus, 3.5 Sonnet, 3.7 Sonnet) are extrapolated from METR's doubling-curve and are less precise than the Opus 4+ numbers.
- SWE-bench Pro (Scale's harder variant) only became available mid-2025; earlier models have no Pro score.
- Data current as of April 2026.
Primary Sources
- Anthropic — "Introducing Claude Opus 4.5" (Nov 2025) and "Claude Opus 4.7" announcement (Apr 16, 2026)
- Anthropic research — "Claude 3.5 Sonnet SWE-bench" (Oct 2024)
- METR — "Time Horizon 1.1" report (Feb 2026) & Opus 4.5 horizon estimate (Dec 2025)
- SWE-bench Verified leaderboard (swebench.com) & Scale SEAL SWE-Pro leaderboard
- DataCamp model guides (Sonnet 4.5, Opus 4.5, Sonnet 4.6)
- AWS Bedrock release notes (Apr 20, 2026) for Opus 4.7