IMO 2025 ends in Australia, and research shows that current AI models still have significant limitations in terms of strict mathematical reasoning.
Conclusion: Test results show that the current AI model still has a lot of room for improvement in solving complex mathematical problems, and there is a significant gap between obtaining correct answers and providing complete mathematical reasoning.
The project “IMO2025‑LLM” is a benchmark tool and script library used to evaluate the performance of the Large Language Model (LLM) on IMO (International Mathematical Olympiad) 2025 competition questions. Published by Hub user PaperPlanet o,
🧠Project background and purpose
- Target: Evaluate the problem-solving capabilities, reasoning processes and problem-solving costs of current mainstream LLMs (such as Anthropic Sonnet 4, ByteDance Seed 1.6, Google Gemini 2.5 Pro) on all six IMO 2025 questions;
- why it matters: IMO-level questions are extremely difficult and challenging, and are very suitable as an “acid test” to measure LLM’s mathematical reasoning ability and proof construction
Content structure and function
- Title link
Each IMO question is linked through AoPS (Art of Problem Solving), allowing users to preview the meaning of the question and build an intuitive understanding - Evaluation script
comprisingevaluate.pyScript that can load local models or API models for testing. The output includes whether the answer is correct, the number of tokens used, cost estimates, and visual comparison charts - strong expansibility
You can add any model (including open source local deployment models) by simply addingconfig.yamlConfigure the API interface or model path, and then run the script again to automatically generate results and present them in the chart
Key assessment data
| model | Correct number of questions | Total token | estimated cost |
|---|---|---|---|
| Claude Sonnet 4 | 2/6 (Questions 1, 3) | ~235k | $3.50 |
| Gemini 2.5 Pro | 2/6 (Questions 1, 5) | ~184k | $1.84 |
| Seed 1.6 | 2/6 (Questions 3, 5) | ~104k | $0.21 |
- Two models (Seed 1.6 and Gemini 2.5 Pro) successfully solved problem 5 completely, which was the only case in the evaluation that completely solved the problem
- Seed 1.6 performs well in terms of accuracy and quality of reasoning, while at the same time being extremely low in cost-only about 17% of the cost compared to Claude
Conclusion and significance
- IMO questions are still high and difficult points in the field of LLM reasoning capabilities;
- project provides aOpen, reproducible, extensibleframework to encourage communities to continuously evaluate more models;
- Question 5 is considered to be a “new acid test” to measure logical rigour and creative reasoning;
- The project is licensed by MIT and includes topics, model output, and evaluation data, and is suitable for research, teaching, product development and other scenarios.
How to use this project
If you want to try using or analyze other models yourself, follow these steps:
- Clone warehouse:
git clone https://github.com/PaperPlaneDeemo/IMO2025-LLM.git cd IMO2025-LLM - View the links, input formats and discrete model descriptions of each question in README;
- Run evaluation scripts, such as Question 5 of testing the local model:
python evaluate.py --model my-local-model --problem 5 - The script will return the solution results (right/wrong), the number of tokens, cost estimates, and update the visualization chart.
summary
- IMO2025‑LLM It is an LLM benchmark specially designed for IMO 2025 math problems;
- It provides evaluation scripts, data logging, cost analysis and visual charts;
- Currently, only Seed 1.6 and Gemini 2.5 Pro can completely solve Question 5, and other questions are still very difficult;
- If you are following LLM’s progress in higher-order mathematical reasoning or want to use them to evaluate custom models, this project is a valuable starting point.
Github:https://github.com/PaperPlaneDeemo/IMO2025-LLM
Oil tubing: