Humanity’s Last Exam for Artifical Intelligence

Well, well! I was unaware of something like this existed until I stumbled across a post on LinkedIn recently. Naturally, the name caught my attention. I mean, something like Humanity’s Last Exam (HLE) could easily sound like a supernatural, hyperphilosophical, doomsday-evaluation kind of thing, right? Instead, it’s a test designed to track how far frontier AI models — especially large language models (LLMs) — have come. HLE is a benchmark of roughly 2,500 questions from 50+ subjects like chemistry, biology, math, social sciences, engineering, computer science etc. Each question represents the frontier domain knowledge of its respective field. In other words, you can’t ace it with guesswork. You would need solid subject-matter expertise.

Center for AI Safety and Scale AI, two San Francisco based organizations organized a global collaboration for this. Over 1,000 contributors from 500+ institutes across 50 countries came together to create a question set “at the frontier of human knowledge” to push AI to its limits in academic reasoning. They started with 70,000 trial questions, testing them against top-tier LLMs like OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro. Only the ones the models failed, which is about 13,000, made it to two rounds of human review. Reviewers had strict instructions: questions had to be PhD-level or higher, original (no common sources), and precise with an objectively correct, unambiguous answer.

Interestingly, reviewers weren’t required to verify a question’s correctness if solving it took more than five minutes.

Researchers from FutureHouse flagged this as an issue — in chemistry and biology, about ~30% of questions conflicted with peer-reviewed literature. They’ve since released a “HLE Bio/Chem Gold” dataset to address this.

HLE has two question formats: exact-match questions where the models must predict the exact word or phrase for an answer, and multiple-choice questions. Because answers are closed-ended, the authors used OpenAI’s o3-mini, a reasoning-oriented LLM, to judge correctness of model predictions. o3-mini is developed to “think” step-by-step and justify its answers. In this case, it compared model predictions to the correct answers and reasoned whether they matched.

Here’s the interesting part: all shortlisted questions were ones the frontier models initially couldn’t answer, right? And you can’t ace it with guesswork either. Yet when researchers tested the final dataset, models still scored above zero but usually under 10% accuracy. The authors cite this observation as “models can inconsistently guess the right answer or guess worse than random chance for multiple choice questions”. So, this low, non-zero accuracy comes down to random guessing.

But progress has been fast. In August 2025 — less than six months after the dataset’s release — several models had already crossed the 20% accuracy mark. GPT-5 now leads the leaderboard at around 26.5%. If this trend holds, models may reach “satisfactory” performance on HLE by year’s end. And just like older benchmarks such as MMLU or MATH, which once stumped LLMs but now see scores over 60–70%, HLE may soon lose its edge as a challenge.

At this pace, humans might soon run out of ways to challenge AI based solely on trained knowledge.

And maybe that was always the point. Throughout history, whenever a new technology emerged, human curiosity always pushes it to its limits until its purpose was fulfilled. That’s the way of human civilization. The only unique thing here is that, this time, we are probably trying to outwit ourselves. Recent updates forecast that many of our livelihoods might be jeopardized by the rise of AI, and some of these predictions have already started to come true. However, technology has always propelled civilization forward or helped it take a leap to the next big thing.

Here, maybe we’re at the juncture of the latter.

For Further Reading:

Humanity’s Last Exam (https://arxiv.org/pdf/2501.14249)

Facebook

X-twitter

Linkedin

Whatsapp

Discover more from CuPod

Subscribe to get the latest posts sent to your email.

Humanity’s Last Exam for Artifical Intelligence

Interestingly, reviewers weren’t required to verify a question’s correctness if solving it took more than five minutes.

At this pace, humans might soon run out of ways to challenge AI based solely on trained knowledge.

For Further Reading:

Discover more from CuPod

Leave a Comment Cancel Reply

Quick Links