Want a smart insight into your inbox? Sign up for our weekly newsletters to get the only thing that is important to enterprise AI, data, and security leaders. Subscribe now
Artificial intelligence models who spend most of their time through problems do not always perform well – and in some cases, they get significantly worse. New research By Anthropic This challenges a fundamental assumption, pushing the latest scaling efforts of the AI industry.
The study, headed by Anthropic AI Safety Fellow Aryo Pradpta Jema And other researchers in the company, indicate what they call “Alta scaling in a test tim computer“Increasing the length of the reasoning of large language models actually worsens their performance in a variety of tasks. These results can lead to important implications for businesses that deploy the AI system that relies on expanded reasoning capabilities.
“We prepa Their paper Published on Tuesday.
New Anthropic Research: “Ulta Scaling in Test Time Computer”
We find issues where long reasoning leads to less accuracy.
– Erio Pradpta Jem (@ARPG) July 22, 2025
Our searches suggest that the ignorant scale of the test time computers can inadvertently reinforce samples of disturbing reasoning.
The research team, including Anthropic’s Ethan Perez, Yanda Chen, and Benton, along with educational colleagues, were examined in four types of works: simple counting issues with engaging people, reactionary work with misleading features, complex deductions and AI protection.
AI Impact Series returning to San Francisco – August 5
The next step of the AI is here – are you ready? Block, GSK, and SAP leaders include for a special look on how autonomous agents are changing enterprise workflows-from real time decision-making to end to automation.
Now secure your place – space is limited:
Claude and GPT Models show clear reasoning failures under expansive processing
This study shows different patterns of failure in the AI system. Claude model “Get engaged faster than irrelevant information” because they are too late because of their The model of the and the series “Resistant to those who engage, but more than the problem framing.” In reactionary activities, “expansion reasoning causes models to move from reasonable prisoners to faster connection,” although the supply of examples largely fixes this behavior.
Perhaps most of the most enterprise users, all models showed “performance degradation with extension reasoning” on complex deductions, “suggesting difficulties in keeping attention during complex deduction works.”
This research also revealed disturbing implications to protect AI. In an experiment, Claude Swant 4 “The growing impression of self -protection” is shown when it is given more time to argue through the scenarios in a possible shutdown.
Researchers noted, “Extension reasoning can increase about the behavior, in which Claude Swant 4 shows the growing expression of self -protection.”
Why long AI processing time does not guarantee better business results
These results challenge the wisdom of the industry that will be permanently improved by AI’s performance dedicated to the reasoning. AI’s large companies have invested so muchTest Time Compete” – Give models more processing time to work through complex issues – as a key strategy to enhance abilities.
Research shows that this approach can have unannounced results. The authors concluded, “Although the test time promises to improve the capabilities of the computing scaling model, it can inadvertently reinforce the reasoning samples of reasoning.”
For enterprise decision makers, its implications are important. Organizations deploying AI system for critical reasoning tasks may need to carefully know how much time they allocate more and more processing, rather than better.
How simple questions are traveled when given time to think too much
Researchers provided concrete examples of the scaling trend. In simple counting tasks, they found that when problems were created to create resemblance to well -known contradictions such as “birthday contradictions”, models often tried to apply sophisticated math solutions rather than direct questions.
For example, when they were asked, “You have an apple and orange … How many fruits do you have?” The embedded, cloud models were quickly engaged in unrelated details within complex mathematics, as the reasoning time increased, sometimes failed to give an easy answer: two.
In reactionary tasks using real students’ data, models initially focused on highly predicted factor (study times), but when given more time for the reason, it moves to less reliable communication.
Enterprise AI appointments need to know about the limitations of the reasoning model
This research comes when big -tech companies have begun to promote their AI system rapidly sophisticated reasoning capabilities. Openai’s O1 Model Series And other ”Based on the reasoning“Model tests represent important investment in time computers scaling.
However, this study shows that bid scaling approaches cannot provide expected benefits and can introduce new risks. “Our results show the importance of testing models at the length of diverse reasoning to identify these failure methods in LRM and to solve them,” Researchers write.
This work is based on previous research that states that AI’s capabilities do not always measure. References of team Big Bench extra hardA benchmark designed to challenge the latest models, noting that “the latest models get the perfect score on many tasks”, in the current benchmark, more challenging diagnosis is needed.
Enterprise users indicates the need for cautious examinations on various arguments and time obstacles of various arguments before deploying AI system in the productive environment. Organizations may need to develop a more proportional approach to allocating computational resources rather than maximizing processing time.
The wider implications of this study show that as the AI systems become more sophisticated, the relationship between computational investment and performance can be more complicated than before. In a sector where billions are being pushed to promote reasoning capabilities, anthropic research offers a serious reminder: Sometimes, the biggest enemy of artificial intelligence is not inadequate processing power – it thinks more.
Research dissertation and interactive demonstrations are available Project websiteTo allow technical teams to find out the scaling effects in different models and tasks.
