The current landscape of artificial intelligence often prioritizes sensational claims over careful consideration.

Demis Hassabis, CEO of Google DeepMind, succinctly expressed this sentiment with three words on X: “This is embarrassing.”
Hassabis’s comment was a response to an enthusiastic post by Sébastien Bubeck, a research scientist at OpenAI. Bubeck had announced that GPT-5, OpenAI’s latest large language model, had reportedly solved 10 previously unsolved mathematical problems, declaring that “Science acceleration via AI has officially begun.”
This incident from mid-October serves as a prime illustration of the issues currently plaguing AI discourse.
Bubeck’s excitement stemmed from GPT-5’s apparent success in tackling several Erdős problems.
Paul Erdős, a prolific 20th-century mathematician, left behind hundreds of puzzles. To track their solutions, Thomas Bloom, a mathematician at the University of Manchester, UK, established erdosproblems.com, which lists over 1,100 problems, with about 430 noted as having solutions.
When Bubeck celebrated GPT-5’s supposed breakthrough, Bloom quickly corrected the claim on X. Bloom clarified that a problem not listed with a solution on his website simply meant he was unaware of one, not that it was necessarily unsolved. Millions of mathematics papers exist, and no single person has read them all, though GPT-5 likely had access to a vast number.
It was discovered that GPT-5 had not generated new solutions but had instead located 10 existing solutions that Bloom had not previously encountered.
This incident offers two key lessons. Firstly, significant breakthroughs should not be announced prematurely on social media; a more cautious approach is needed. Secondly, GPT-5’s capability to unearth obscure references to prior work, even if not original discovery, is remarkable in itself. This valuable aspect was overshadowed by the initial, exaggerated claim.
François Charton, a research scientist at the AI startup Axiom Math, noted that mathematicians are keen to use LLMs for sifting through extensive existing research. However, literature search lacks the allure of genuine discovery, especially for enthusiastic AI proponents on social media. Bubeck’s misstep is not an isolated case.
In August, mathematicians demonstrated that no LLM at the time could solve Yu Tsumura’s 554th Problem. Two months later, social media buzzed with reports that GPT-5 had succeeded. One observer compared it to the “Lee Sedol moment,” referencing the Go master’s loss to DeepMind’s AlphaGo in 2016.
Charton, however, highlighted that solving Yu Tsumura’s 554th Problem is not considered a major achievement by mathematicians. He described it as a question suitable for an undergraduate, noting a tendency to exaggerate such accomplishments.
Meanwhile, more balanced evaluations of LLM capabilities are emerging. Concurrently with the online debate about GPT-5, two new studies examined LLM use in medicine and law—fields where AI developers have often claimed their technology excels.
Researchers found that while LLMs could assist with certain medical diagnoses, they were deficient in recommending treatments. In legal contexts, studies indicated that LLMs frequently provided inconsistent and inaccurate advice. The authors concluded that the evidence thus far “spectacularly fails to meet the burden of proof.”
Such nuanced findings, however, do not typically gain traction on platforms like X. Charton explained that the intense excitement on social media stems from a desire to stay current, as X often serves as the primary channel for AI news, new results, and public debates among prominent figures like Sam Altman, Yann LeCun, and Gary Marcus. The pace is challenging to keep up with, and the spectacle is hard to ignore.
Bubeck’s post became embarrassing only because his error was quickly identified. Not all inaccuracies are. Without a shift in approach, researchers, investors, and general boosters may continue to reinforce each other’s exaggerated claims. Charton observed that while some are scientists, many are not, but all are enthusiasts, and “huge claims work very well on these networks.”
Recent Developments in AI Math Models
Following these discussions, Axiom Math’s own model, AxiomProver, reportedly solved two open Erdős problems (#124 and #481). This was a significant achievement for a small startup established only months prior, demonstrating the rapid pace of AI development.
Furthermore, five days later, AxiomProver was announced to have solved nine out of 12 problems in the annual Putnam competition, a collegiate mathematics challenge often considered more difficult than the International Math Olympiad (which LLMs from Google DeepMind and OpenAI had excelled at months earlier).
The Putnam results garnered praise on X from notable figures such as Jeff Dean, chief scientist at Google DeepMind, and Thomas Wolf, cofounder of Hugging Face. However, familiar debates resurfaced in the replies. Some researchers noted that while the International Math Olympiad emphasizes creative problem-solving, the Putnam competition primarily tests mathematical knowledge, making it notoriously difficult for undergraduates but potentially more accessible for LLMs trained on vast internet data.
Evaluating Axiom’s accomplishments requires more than social media pronouncements. The impressive competition victories are merely a starting point. A thorough understanding of LLMs’ mathematical abilities necessitates a deeper investigation into their methods when solving complex problems.

