TechTakes@awful.systemsEnglish · 1 month ago

LLMs average <5% on 2025 Math Olympiad; award each other 20x points

arxiv.org

LLMs average <5% on 2025 Math Olympiad; award each other 20x points

arxiv.org

slop_as_a_service@awful.systems to

TechTakes@awful.systemsEnglish · 1 month ago

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

arxiv.org

Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly, achieving less than 5% on average. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.

“Notably, O3-MINI, despite being one of the best reasoning models, frequently skipped essential proof steps by labeling them as “trivial”, even when their validity was crucial.”

Chat

froztbyte@awful.systems
link
fedilink
English
arrow-up
10·
1 month ago
they gotta make up for all those scary cave-wall pictures somehow

TechTakes@awful.systems

techtakes@awful.systems

Create a post

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !techtakes@awful.systems

Big brain tech dude got yet another clueless take over at HackerNews etc? Here’s the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

75 users / day
1.12K users / week
2.4K users / month
5.15K users / 6 months
3 local subscribers
1.87K subscribers
870 Posts
24.3K Comments
Modlog

mods:
David Gerard@awful.systems