Lessons from a DeepSeek R1 mini eval on AIMO2 Ref-10: Are Thinking LLMs A New Paradigm for Evaluation and Inference?

30/01/25

Tags:

Cite

@misc{frieder2025thinkingLLMsparadigm,
  title={Thinking LLMs: A New Paradigm for Evaluation and Inference},
  author={Simon Frieder},
  year={2025}
  month={January}
  url={https://www.friederrr.org/blog/thinking-llms-new-paradigm-eval-inference}
  howpublished = {\url{https://www.friederrr.org/blog/thinking-llms-new-paradigm-eval-inference}}

DeepSeek R1 has been released, and everyone lost it (for Nvidia, “it” was $100bn in stock value). I have a rather different view of DeepSeek’s impact. But I didn’t want more opinions or anecdotal evidence about R1’s capabilities. I wanted hard data, so I made a mini-eval on the 10 reference problems for the second AI Math Olympiad (AIMO2), which I shall call here the “AIMO2 Ref-10” dataset. Just 10 problems, you might say? Well, read on - I think R1, which, unlike o1, has transparent thinking tokens, ushered in a new era of LLMs by generating outputs of unprecedented lengths - and this will require a bit of adjusting how you think about LLMs.

I did a full run of the AIMO2 Ref-10 dataset on three models: o1, Qwen-32B distilled version of DeepSeek R1 (“deepseek_r1_qwen_32b_distill”) and the full DeepSeek R1. The problems from AIMO2 Ref-10 are all at the national Olympiad level - i.e., very hard math problems, though the problems only use high-school-level mathematical concepts.

Interestingly, the “airline” problem among these ten problems was intensely discussed on the AIMO Kaggle discussion by contestants, and it caused a lot of confusion among them and seemed very hard for humans too; arguably, this problem could be even beyond the national level and at the IMO level. It was the only problem on which all models failed!

R1 solves 8/10 unseen math olympiad problems (again, the solutions didn’t exist publicly, so no contamination issue exists); this rises to 9/10 if R1 was prompted a second time. o1 solves 7/10 and deepseek_r1_qwen_32b_distill solves 5/10. R1, compared to o1, could have an advantage by design because R1 is biased towards competitive math problems like the ones from AIMO2 (its initial fine-tuning stage focuses on these types of problems, e.g. using GSM8K, which are much easier but similar in “spirit”), and these do not require significant amounts of mathematical abstraction.

Getting this mini-eval to wasn’t entirely straightforward: I wasn’t able to get R1 to run with Ollama on an 8x H100, as it kept timing out; I chose not to spend time fixing it, and keep doing this mini-eval, so I used the GUI, which also timed out a few times, but less often. Running o1, on the other hand, went flawlessly. Running deepseek_r1_qwen_32b_distill continually timed out on the HuggingFace Playground; I then used a dedicated Inference Endpoint from HuggingFace with 2x A100s, where it also still timed out three times.

More interesting than the scores and the technical bits is the length of the thinking segment for R1. For the “airline” problem it cannot solve, it generates a whopping 39 pages (!!). This is too much to inspect, as a human, at scale (and too difficult to trust o1 fully to inspect it for me, although it did quite well). It seems we’ve thus entered a new paradigm it seems with these thinking models: Previously, we didn’t really know what happened during training - now we also don’t know what happens during inference :D.
In total, R1 alone generated over 150 pages of content for just 10 problems. Feel free to download it here, it is just a quick copy-paste of the outputs to an odt file (with the thinking parts in grey); no prompts are copied, and the order is the same as in the AIMO2 reference problem csv file. This much math is simply too much to inspect.

While I tried to ask GPT4o and o1 to assess the output of the “airline” problem, and that worked to a degree, I worry we might reach a point where it will just be hard to inspect (and correct) these models on reasoning domains, where there are many, very long chains of reasoning - and my preprint on new datasets that we’ll need to be able to train math copilots points in this direction. We will need automated tools to inspect long reasoning chains, but we don’t have them. This raises significant hurdles to move forward in this domain, as in the past we relied on humans annotating a lot of data, but for complex domains such as math, it could be hard to find a sufficiently large cohort of experts, and if you do find them, they might be very expensive to pay. Companies already offer hundreds of pounds when they recruit mathematicians to generate a single problem, which shows how steep the costs could be here. Really good models like o1 (or the rumoured imminent o3) might help us to some extent, but it’s unclear for how long even these models will be enough. We might enter a new paradigm with the thinking LLMs, where evaluating a model beyond merely anecdotal evidence, but with rigorous scientific criteria, will become very hard to do.

I believe R1 is trained to give an answer in any case. I think that with some techniques to teach it to delay giving an answer, one could make it better. Although for the first reference problem, R1 just “oscillates” between a handful of attempts over and over again - this would be one avenue to make improvements. We saw a jump in the leaderboard at AIMO2, too, and the contestants will probably throw all known cheap techniques on it. Probably, those Kaggle notebooks will be a good source of future improvements on top of R1. Have a look at our highly active Discussion forum :)

Now, on to the more hairy issues: Is it justified to be worried about DeepSeek (the company) and R1 (the model) if you’re a Western entity? The best summary of DeepSeek’s background I’ve found here, so there’s no point in repeating it. Rather, let me, ahem, distil the essential piece of information from it for you (there are many other interesting bits of info there; go read it!): There is now a way, by using many clever optimization tricks, to train high-end LLMs essentially on hardware affordable for academia. This will unlock all kinds of benefits - and free up compute to go for AGI, rather than circling around the current LLM stack (which is rather far from imitating the brain), train models that aren’t yet as optimized as certain LLMs. If that isn’t good news for the AI landscape, I don’t know what is. (And if that is worrisome for the AI safety community I don’t know what is!)