Math Benchmark Test - Search News

Hosted on MSN

AI is actually bad at math, ORCA shows

ORCA benchmark trips up ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2 In the world of George Orwell's 1984, two and two make five. And large language models are not much ...

PC Gamer

A new math benchmark just dropped and leading AI models can solve 'less than 2%' of its problems... oh dear

Sometimes I forget there's a whole other world out there where AI models aren't just used for basic tasks such as simple research and quick content summaries. Out in the land of bigwigs, they're ...

Morning Overview on MSN

AI is cracking "impossible" math. Can it beat top humans?

Artificial intelligence has moved from checking homework to attacking problems that professional mathematicians once treated as out of reach. Systems tuned for symbolic reasoning are now cracking long ...

eWeek

FrontierMath Benchmark Exposes AI Struggles in Advanced Math

eSpeaks’ Corey Noles talks with Rob Israch, President of Tipalti, about what it means to lead with Global-First Finance and how companies can build scalable, compliant operations in an increasingly ...

OfficeChai

OpenAI Releases GPT 5.2, Beats Google Gemini 3 Pro On Several Benchmarks

OpenAI had been stung by Google’s release of Gemini 3 Pro which had eclipsed it on most benchmarks, but it’s thrown a ...

Ars Technica

New secret math benchmark stumps AI models and PhDs alike

I think of an AI as a script kiddie. A very good script kiddie, but never the less a basic script kiddie, If it hasnt seen the script for the answer, then it can't give the answer. In other words, an ...

VentureBeat

Microsoft’s GRIN-MoE AI model takes on coding and math, beating competitors in key benchmarks

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Microsoft has unveiled a groundbreaking artificial intelligence model, ...

Yahoo

Why most AI benchmarks tell us so little

On that last question, not likely. The reason -- or rather, the problem -- lies with the benchmarks AI companies use to quantify a model's strengths -- and weaknesses. Jesse Dodge, a scientist at the ...

14d

Databricks' OfficeQA uncovers disconnect: AI agents ace abstract tests but stall at 45% on enterprise docs

The answer, according to new research from the data and AI platform company, is sobering. Even the best-performing AI agents achieve less than 45% accuracy on tasks that mirror real enterprise ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results