New ‘Markovian thinking’ technique paves the way for million-token AI reasoning

by SkillAiNest

New ‘Markovian thinking’ technique paves the way for million-token AI reasoning

MELA researchers have proposed a new technique that makes large language models (LLMs) more efficient when performing complex reasoning. It is called Markovian thoughtthe approach allows LLMs to engage in lengthy reasoning without the prohibitive computational costs that currently limit such tasks.

The team’s implementation, an environment called Deletonic, structures the reasoning chain into fixed-sized chunks, breaking the scaling problem that plagues very long LLM responses. Preliminary estimates show that for a 1.5b parameter model, this method can reduce training costs by more than two-thirds compared to the standard approach.

The square curse of long chain reasoning

For an LLM to solve a complex problem, it often needs to generate a long series of intermediate “thought” tokens, often called a chain of thought (COT). In recent years, researchers have discovered that using Reinforcement learning .

However, the standard approach to this has one major flaw: AI’s "State" . For modern Transformer based modelthis means that the computational cost explodes exponentially as the reasoning chain gets longer, making it prohibitively expensive to train models for very complex tasks.

Most current efforts to handle this cost focus on limiting how much the model thinks, clearly preferring smaller solutions or terminating the process early. Although these methods offer some relief, Mela researchers still work within the Longquote framework and are thus essentially bound by its quadratic nature.

Instead of trying to control the computational growth, Mela created an RL environment that completely avoids the quadratic problem. As co-author Amirhossein Kazimanjad explained, it aims to enable skills such as multi-week reasoning and scientific discovery. "This regime (and the RL required to enable such capabilities) is not supported by the current Longquote paradigm, because of the quadratic compute cost," He said.

Thinking in chunks with dilettantes

The researchers’ solution is what they call a paradigm "Markovian thinker," where the model reasons while maintaining the window size of its reasoning context. The basic idea is to isolate the RL setup "How long does the model think?" from "How much perspective should there be on this?" If done correctly, the Markovian theorem transforms the quadratic growth problem into linear compute and fixed memory requirements for LLM reasoning.

The researchers put this paradigm into practice with Deletonic, which forces the model to reason in a sequence of fixed-sized chunks, such as 8,000 tokens at a time. Within each section, the model reasons such as typically using classical focus mechanisms. But when it reaches the segment limit, the environment resets the context, generating a new prompt that includes a shorthand in addition to the original query. "Carrier" For example from the previous section, the carryover could be the last few tokens of the previous section or a summary of the most important results.

This reformulation of the problem forces the model to learn how to embed a summary of its progress, or a "textual Markovian state," In this carryover to continue our reasoning in the next section. This indicates a common concern about whether the model can recall important details from earlier stages.

According to Kazimnejad, the model learns what to remember. "With training … the model task is forced to learn to advance the critical state," He explained. They added an important clarification for practical use: the original input prompt is not modified, including documents or context data. “Our approach aims at the reasoning stage and does not modify the gesture," He said.

The dilettantish in action

To test their approach, the researchers trained R1-distill-1.5b with Delatonic on a dataset of competitive-level math problems, then evaluated it against several benchmarks. The model was trained with up to 24,000 tokens but with a fixed fraction of 8,000 tokens.

The researchers This was compared with models trained with the standard Longcoat-RL method. Their results show that a model trained with Delatech can reason up to 24,000 tokens, and it matched or outperformed a Longcoat model trained with the same 24,000-token budget on the math benchmark. On other tasks such as coding and PhD-level questions, Deletank also matched or slightly beat its Longquote counterpart. “Overall, these results indicate that Deletonic uses its thought tokens as efficiently as Longcoat-RL with less computation,” the researchers write.

The benefits become even more pronounced when scaling beyond the training budget. Although the models trained with LongCoat quickly reached their training limits, the model trained with Diletinc continued to improve its performance. For example, some math problems were solved only after the model had outgrown its 24,000-token training budget by up to 140,000 tokens. This linear compute advantage is sufficient for enterprise applications. The researchers estimate that training a model to an average thought length of 96,000 tokens would require 27 H100-GPU months with LongCoat, compared to only 7 with Deletank.

This efficiency extends directly to estimating, the primary operational cost for most businesses. "Models trained in Markovian thinking use the same notation style (dialectic tracing) at test time, which provides the same advantages of linear computation and persistent memory after training," Kazimanjad said. He gave a practical example: An AI agent can "Debug a large codebase and think about a long time … which significantly reduces costs compared to the traditional longcode approach."

Interestingly, the researchers found that off-the-shelf reasoning models, even without special training, already demonstrate some ability to think in Markovian fashion. This finding has immediate practical implications for developers. "In practice, this means that—without dilatonic-RL—these models can already run dilatonic tracing wrappers and perform competitively with longcoat on our benchmarked tasks," Kazimanjad said.

Large models such as their experiments GPT-SOS 120b Demonstrated strong performance with a range of complex tasks. This lasting ability provides a strong starting point for RL training, which helps explain why the method is so effective. “Together, these results suggest that the dilatonic is consistent and scales with the most recent models,” the researchers concluded.

The success of Markovian thinking shows that this is possible "Next-generation reasoning models to think about millions of tokens," The researchers note. This opens the door to fundamentally new AI capabilities beyond current barriers.

"Markovian thinking … opens the way for these models of ‘thinking’ for much longer horizons, which we see as a necessary step towards the ultimate scientific discovery," Kazimanjad said. "Our approach removes a critical barrier and allows training for long-horizon tasks, enabling next-generation talent."

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro