Math Benchmark Test - Search News

AI scores a ‘C–’ on its hardest math test yet

The second batch of “First Proof” problems is meant to evaluate AI’s usefulness for research-level math. The best model got ...

Ars Technica

New secret math benchmark stumps AI models and PhDs alike

On Friday, research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that ...

Hosted on MSN

AI is actually bad at math, ORCA shows

ORCA benchmark trips up ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2 In the world of George Orwell's 1984, two and two make five. And large language models are not much ...

Tom's Guide

AI models are getting better at grade school math — but a new study suggests they may be cheating

For the fastest way to join Tom's Guide Club enter your email below. We'll send you a confirmation and sign you up to our newsletter to keep you updated on all the latest news.

VentureBeat

LiveBench is an open LLM benchmark that uses contamination-free test data and objective scoring

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now A team of Abacus.AI, New York University, ...

Geeky Gadgets

Al Benchmarks Investigated : Do Companies Tune Private Builds for Leaderboards, Then Ship Weaker Versions?

Are AI benchmarks really the gold standard we’ve been led to believe? Matt Wolfe walks through how these widely accepted metrics, designed to measure the performance of artificial intelligence systems ...

Ars Technica

New secret math benchmark stumps AI models and PhDs alike

I think of an AI as a script kiddie. A very good script kiddie, but never the less a basic script kiddie, If it hasnt seen the script for the answer, then it can't give the answer. In other words, an ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results