This is an awesome list for LLM scientific benchmarks. It curates resources for evaluating AI on complex scientific reasoning tasks across multiple disciplines, aiming for accuracy-first evaluation. The list is built in TypeScript and has already garnered 10 stars on GitHub, signaling early community interest in advancing AI for science.
Opening Kapyn…