kapynResearch

UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

AI benchmarks underestimate agent capabilities by limiting compute budgets. Increasing token budgets tenfold boosted success rates by 25% on software engineering tasks, showing actual progress is significantly steeper than measured. Newer models see the greatest gains from expanded compute.

The Decoder·Jul 3, 2026

Opening Kapyn…