AI benchmarks underestimate agent capabilities by limiting compute budgets. Increasing token budgets tenfold boosted success rates by 25% on software engineering tasks, showing actual progress is significantly steeper than measured. Newer models see the greatest gains from expanded compute.
Opening Kapyn…