
Researchers from NVIDIA have developed a new technique that transforms the script on how the larger language models (LLM) learn to argue.
Method, which is called Learn Learning Pre -Training (RLP), integrates RL into the initial training phase instead of saving it to the end.
This approach The model encourages that “before thinking about what incoming predictions, thus teaches a pre -free -thinking behavior,” Researchers describe in their dissertation.
By learning to argue on simple text without the need for outdoor certificates, Trained models trained with RLP bring significant improvement in learning of complex reasoning Brave, more capable of real -world works, pointing to the future of AI.
Ordinary LLM training cycle
Generally, large language models already receive training in a wide amount of text using A "Next token forecasts" The goal, where they are given a string of the text and are said to be the next word (or token). At this stage, they learn grammar, facts and basic associations.
In the post -training phase, models usually learn complex reasoning capabilities such as China thinking (CO) where a model offers its reasoning step by step. This stage often involves monitoring fine toning (SFT) or Learn to help from human impression (RLHF), which requires special, curatic datases.
The authors of the dissertation argue that this order is not similar to human understanding, which is “not a linear token by token process, but rather a parallel integration of input with the knowledge of the previous.” The current methods of pre -training is lacking this procedure, which hinders the ability to develop a deep reasoning model from the beginning.
How does training work before learning reinforcement
The RLP has presented this process as an action before predicting the next token by treating the COT’s breed as an action. At every step, the model produces the first internal "Thought" Or a series of arguments. It then predicts the next word in the text, and using its new thinking with the original context.
The model gets a reward on the basis of how much its thinking has made the accuracy of his prediction better than the baseline that has not created any thinking (the next token is a pure forecast). The award indicator is likely to be automatically changed, which eliminates the need for external certificates or human labeled data.
The reward is only positive when the thinking thinking helps the model to better predict the next token. By giving awards based on the benefits of their predictions, the RLP will effectively teaches the model how useful on the same massive, non -imposed datases used for standard pre -training.
This constant feedback allows the loop model to learn that when a simple prediction is sufficient and when it needs to engage in deep reasoning. As researchers said, “RLP is designed to create thinking in base models Just rewarding ideas that help me token forecasts.
However, this basic approach does not obstruct the fine toning stages later. According to Brian Katanzaro, the VP of Applied Deep Learning Research in NVIDIA and co -author of this dissertation, RLP, is designed to fulfill these important steps. "The purpose of RLP is not to change the post -training stages such as learning from fine toning or human impression of surveillance," Kittenzro told Venturebet. "These steps are very important for improving the model’s behavior … It is really designed to enhance the effectiveness of their subsequent stages by giving the model head start."
Rlp in action
In experiments with Qwen3-1.7b And Neemotron-Nano-12bNvidia team tested RLP in a benchmark of mathematics and science reasoning. The results reveals The models with the RLP have permanently performed their traditional trained counterparts, especially gaining strong benefits in heavy tasks.
This can translate into a more reliable output in a multi -faceted workflower such as a financial analysis or legal document summary of an enterprise, a better reasoning or legal document summary.
"The RLP stimulates the model to think before the model predicted, and helps the model more integrated reasoning to make the style of reasoning," Kittenzro said. "This can help reduce the subtle logical mistakes, especially in the long work flow.
While emphasizing that the RLP trained models will still need normal guards, such as verification layers, human surveillance, and consistency examination, Katanzaro said, “RLP gives you a strong baseline."
The important thing is, instead of disappearing during the subsequent toning stages, the benefits of the RLP compound (destructively forgetting is a common problem in LLM training, where later training stages can cause the model to forget the skills and knowledge of their past). The RLP-trained model achieved a gross score that was 7-8 % higher than basins after the same post-training behavior. Researchers have concluded that the RLP “establishes the foundations of strong reasoning that are not washed away alignment, but rather with later training.”
The technique performance is a key search. On the QWen3-1.7B model, the RLP improved the performance of 17 % in performance compared to standard continuous pre-training and also defeated similar techniques, called Circurcing Prainsing through Prefix Matching Rewards (RPT). This benefit was also trained when the baseline model was trained with 35 times more data to meet computational costs, confirming that the benefits from this procedure were obtained only by confirming that not only processing.
In addition, the RLP demonstrates impressive scale and versatility, which has successfully extracted an argument signal from the general purpose web data-not just curating datases. When the hybrid mamba transformer model applies to Nano -12B, RLP gained 35 % relative improvement compared to heavy -trained baseline Using only a small portion of the data.
Although these results point to a more efficient way to build powerful models, Katanzo produces as a fundamental change in the process of learning this innovation, rather than a quick solution to the high training costs.
"This research is interesting because it offers a change in how models absorb the information during pretarling, which leads to a better learning process," He explained. "It will not replace large -scale preteraning, but offers another creative method in building the best possible model."
A new Foundation for AI training
Finally, the RLP points to a future where pre -training is no longer a single token forecast. Instead, the next generation of models can be built on a hybrid of purposes, which produces AI, which learns to think firmly than the first day. Kitnazro offers a powerful imitation to prepare this shift:
"The next token forecast teaches a model how the world looks like. Simple style goals like RLP can teach what it is seeing, how to think about," He said. "A combination of these two purposes can help develop models a longer, more organized thinking development … Tools like RLP can build in the upper part of this foundation, making learning more active, curious and even more efficient."
There is still a lot about the dynamics of learning in the pre-training phase, but what is clearly known is that “introduction to the first research in training opens a new axis for scaling-not just in size, but also because of models.”