AI challenges Mathematical Olympiad IMO 2025: Only two models provide complete answers

IMO 2025 ends in Australia, and research shows that current AI models still have significant limitations in terms of strict mathematical reasoning.
Conclusion: Test results show that the current AI model still has a lot of room for improvement in solving complex mathematical problems, and there is a significant gap between obtaining correct answers and providing complete mathematical reasoning.

The project “IMO2025‑LLM” is a benchmark tool and script library used to evaluate the performance of the Large Language Model (LLM) on IMO (International Mathematical Olympiad) 2025 competition questions. Published by Hub user PaperPlanet o,

🧠Project background and purpose

Target: Evaluate the problem-solving capabilities, reasoning processes and problem-solving costs of current mainstream LLMs (such as Anthropic Sonnet 4, ByteDance Seed 1.6, Google Gemini 2.5 Pro) on all six IMO 2025 questions;
why it matters: IMO-level questions are extremely difficult and challenging, and are very suitable as an “acid test” to measure LLM’s mathematical reasoning ability and proof construction

Content structure and function

Title link
Each IMO question is linked through AoPS (Art of Problem Solving), allowing users to preview the meaning of the question and build an intuitive understanding
Evaluation script
comprising evaluate.py Script that can load local models or API models for testing. The output includes whether the answer is correct, the number of tokens used, cost estimates, and visual comparison charts
strong expansibility
You can add any model (including open source local deployment models) by simply adding config.yaml Configure the API interface or model path, and then run the script again to automatically generate results and present them in the chart

Key assessment data

model	Correct number of questions	Total token	estimated cost
Claude Sonnet 4	2/6 (Questions 1, 3)	~235k	$3.50
Gemini 2.5 Pro	2/6 (Questions 1, 5)	~184k	$1.84
Seed 1.6	2/6 (Questions 3, 5)	~104k	$0.21

Two models (Seed 1.6 and Gemini 2.5 Pro) successfully solved problem 5 completely, which was the only case in the evaluation that completely solved the problem
Seed 1.6 performs well in terms of accuracy and quality of reasoning, while at the same time being extremely low in cost-only about 17% of the cost compared to Claude

Conclusion and significance

IMO questions are still high and difficult points in the field of LLM reasoning capabilities;
project provides aOpen, reproducible, extensibleframework to encourage communities to continuously evaluate more models;
Question 5 is considered to be a “new acid test” to measure logical rigour and creative reasoning;
The project is licensed by MIT and includes topics, model output, and evaluation data, and is suitable for research, teaching, product development and other scenarios.

How to use this project

If you want to try using or analyze other models yourself, follow these steps:

Clone warehouse:git clone https://github.com/PaperPlaneDeemo/IMO2025-LLM.git cd IMO2025-LLM
View the links, input formats and discrete model descriptions of each question in README;
Run evaluation scripts, such as Question 5 of testing the local model:python evaluate.py --model my-local-model --problem 5
The script will return the solution results (right/wrong), the number of tokens, cost estimates, and update the visualization chart.

summary

IMO2025‑LLM It is an LLM benchmark specially designed for IMO 2025 math problems;
It provides evaluation scripts, data logging, cost analysis and visual charts;
Currently, only Seed 1.6 and Gemini 2.5 Pro can completely solve Question 5, and other questions are still very difficult;
If you are following LLM’s progress in higher-order mathematical reasoning or want to use them to evaluate custom models, this project is a valuable starting point.

Github：https://github.com/PaperPlaneDeemo/IMO2025-LLM

Oil tubing: