Explore

Feed Tags Users Conversations Inbox

Groups

Organizations User groups

Details Conversations Variables Preview Fullscreen

Shortname: wOLn4

Sondov Engen

CLAUDE CODING CAPABILITY EVOLUTION

Sondov Engen(@sen)1mth ago

137

2

0

not following

No topics yet. Start the conversation.

Summary

User supplied summary for the plot

Claude model coding benchmarks and autonomous task horizon over time

Description

The below description is supplied in free-text by the user

Methodology

Benchmark scores and autonomous task duration for each Claude model release. The Opus 4.5 release (Nov 2025) marks the clear inflection — first model over 80% on SWE-bench Verified, and autonomous coding horizon 5x earlier Opus models.

Metrics

SWE-bench Verified — Industry-standard coding benchmark: 500 real GitHub issues, human-validated for solvability. Scores use each vendor's best scaffold unless noted.
Autonomous Task Horizon — METR's "50% success time horizon": the length of task a model completes successfully ~50% of the time. Doubling time accelerated from 7 months through 2024 to ~89 days by early 2026 (METR Time Horizon 1.1 report, Feb 2026).

Key Inflection Points

Opus 4.5 (Nov 2025) — First model past 80% SWE-bench Verified; 4h49m autonomous horizon. This is where "agentic coding" became viable for real-world software engineering.
Opus 4.7 (Apr 2026) — Highest public SWE-bench score ever (87.6%); 6h20m autonomous horizon.

Notes

Earlier horizon figures (Claude 3 Opus, 3.5 Sonnet, 3.7 Sonnet) are extrapolated from METR's doubling-curve and are less precise than the Opus 4+ numbers.
SWE-bench Pro (Scale's harder variant) only became available mid-2025; earlier models have no Pro score.
Data current as of April 2026.

Primary Sources

Anthropic — "Introducing Claude Opus 4.5" (Nov 2025) and "Claude Opus 4.7" announcement (Apr 16, 2026)
Anthropic research — "Claude 3.5 Sonnet SWE-bench" (Oct 2024)
METR — "Time Horizon 1.1" report (Feb 2026) & Opus 4.5 horizon estimate (Dec 2025)
SWE-bench Verified leaderboard (swebench.com) & Scale SEAL SWE-Pro leaderboard
DataCamp model guides (Sonnet 4.5, Opus 4.5, Sonnet 4.6)
AWS Bedrock release notes (Apr 20, 2026) for Opus 4.7