Lessons from a DeepSeek R1 mini eval on AIMO2 Ref-10: Are Thinking LLMs A New Paradigm for Evaluation and Inference?

Cite
@misc{frieder2025thinkingLLMsparadigm,
title={Thinking LLMs: A New Paradigm for Evaluation and Inference},
author={Simon Frieder},
year={2025}
month={January}
url={https://www.friederrr.org/blog/thinking-llms-new-paradigm-eval-inference}
howpublished = {\url{https://www.friederrr.org/blog/thinking-llms-new-paradigm-eval-inference}}
DeepSeek R1 has been released, and everyone lost it (for Nvidia, "it" was $100bn in stock value). I have a rather different view of DeepSeek's impact. But I didn't want more opinions or anecdotal evidence about R1's capabilities. I wanted hard data, so I made a mini-eval on the 10 reference problems for the second AI Math Olympiad (AIMO2), which I shall call here the "AIMO2 Ref-10" dataset. Just 10 problems, you might say? Well, read on - I think R1, which, unlike o1, has transparent thinking tokens, ushered in a new era of LLMs by generating outputs of unprecedented lengths - and this will require a bit of adjusting.
I did a full run of the AIMO2 Ref-10 dataset on three models: o1, Qwen-32B distilled version of DeepSeek R1 ("deepseek_r1_qwen_32b_distill") and the full DeepSeek R1. The problems from AIMO2 Ref-10 are all at the national Olympiad level - i.e., very hard math problems, though the problems only use high-school-level mathematical concepts.
These problems are better to use than other problems with hard proofs, as the full proofs aren't yet public (only the final numeric answer is), except for one problem, the "airline" problem. Thus, there likely was no contamination of the proofs of the AIMO problems on the training sets of R1 and o1. Solutions to the "airline" problem have been posted on the AIMO Kaggle discussion by contestants, and it caused a lot of confusion among them and seems very hard for humans too, arguably this could be even beyond national level and at IMO level. Even though it was the only problem with a public proof, so potentially contaminated training data, it interestingly was the problem on which all models failed.
R1 solves 8/10 unseen math olympiad problems (again, the solutions didn't exist publicly, so no contamination issue exists); this rises to 9/10 if R1 was prompted a second time. o1 solves 7/10 and deepseek_r1_qwen_32b_distill solves 5/10. R1, compared to o1, could have an advantage by design because R1 is biased towards competitive math problems like the ones from AIMO2 (its initial fine-tuning stage focuses on these types of problems, e.g. using GSM8K, which are much easier but similar in "spirit"), and these do not require significant amounts of mathematical abstraction.
Getting this mini-eval to wasn't entirely straightforward: I wasn't able to get R1 to run with Ollama on an 8x H100, as it kept timing out; I chose not to spend time fixing it, and keep doing this mini-eval, so I used the GUI, which also timed out a few times, but less often. Running o1, on the other hand, went flawlessly. Running deepseek_r1_qwen_32b_distill continually timed out on the HuggingFace Playground; I then used a dedicated Inference Endpoint from HuggingFace with 2x A100s, where it also still timed out three times.
More interesting than the scores and the technical bits is the length of the thinking segment for R1. For the "airline" problem it cannot solve, it generates a whopping 39 pages (!!). This is too much to inspect, as a human, at scale (and too difficult to trust o1 fully to inspect it for me, although it did quite well). It seems we've thus entered a new paradigm it seems with these thinking models: Previously, we didn't really know what happened during training - now we also don't know what happens during inference :D.
In total, R1 alone generated over 150 pages of content for just 10 problems. Feel free to download it here, it is just a quick copy-paste of the outputs to an odt file (with the thinking parts in grey); no prompts are copied, and the order is the same as in the AIMO2 reference problem csv file. This much math is simply too much to inspect.
While I tried to ask GPT4o and o1 to assess the output of the "airline" problem, and that worked to a degree, I worry we might reach a point where it will just be hard to inspect (and correct) these models on reasoning domains, where there are many, very long chains of reasoning - and my preprint on new datasets that we'll need to be able to train math copilots points in this direction. We will need automated tools to inspect long reasoning chains, but we don't have them. This raises significant hurdles to move forward in this domain, as in the past we relied on humans annotating a lot of data, but for complex domains such as math, it could be hard to find a sufficiently large cohort of experts, and if you do find them, they might be very expensive to pay. Companies already offer hundreds of pounds when they recruit mathematicians to generate a single problem, which shows how steep the costs could be here. Really good models like o1 (or the rumoured imminent o3) might help us to some extent, but it's unclear for how long even these models will be enough. We might enter a new paradigm with the thinking LLMs, where evaluating a model beyond merely anecdotal evidence, but with rigorous scientific criteria, will become very hard to do.
Also notice: R1 is trained to give an answer in any case, I believe. I think that with some techniques to teach it, perhaps to delay giving an answer, one could make it better. We saw a jump in the leaderboard at AIMO2 too, and the contestants will probably throw all known cheap techniques on it. Probably, those Kaggle notebooks will be a good source of future improvements on top of R1. Have a look at our highly active Discussion forum :)