Disclosure By Michael A.G. · Mar 7, 2026

Global Researchers Unveil "Humanity's Last Exam" to Gauge Advanced AI Capabilities

An international team of researchers has introduced a new benchmark, "Humanity's Last Exam" (HLE), designed to push the boundaries of current artificial intelligence systems. This comprehensive assessment features 2,500 expert-level questions spanning a vast array of disciplines, from complex mathematics and natural sciences to ancient languages and humanities. The initiative comes as traditional AI evaluation methods are becoming obsolete, with advanced models easily surpassing older tests and making it difficult to accurately measure their true potential. Details of this ambitious project and its initial findings are published in a recent study in the journal *Nature*.

Initial results from HLE indicate that even the most sophisticated AI models currently struggle significantly with the new exam. Early iterations like GPT-4o and Claude 3.5 Sonnet achieved scores below 5%, while OpenAI’s o1 model reached approximately 8% accuracy. More recent systems, including Gemini 3.1 Pro and Claude Opus 4.6, showed improvement, scoring in the range of 40-50% accuracy. This performance highlights a considerable gap between current AI capabilities and the deep, contextual understanding required to master expert-level human knowledge across diverse fields.

The creation of Humanity's Last Exam involved nearly 1,000 researchers, each contributing specialized questions from their respective fields. This collaborative effort aims to provide a more robust and challenging metric for AI development, moving beyond simple pattern recognition to assess deeper intelligence, context, and specialized expertise. The HLE serves as a critical tool for understanding the true limitations and future directions of artificial intelligence as it continues to evolve.