Google’s new AI training method helps small models deal with complex reasoning

by SkillAiNest

Google’s new AI training method helps small models deal with complex reasoning

The researchers Google Cloud And UCLA have proposed a new reinforcement learning framework that significantly improves the ability of language models to learn many complex multilevel reasoning tasks. Monitoring reinforcement learning .

This approach enables small models to learn complex problems that were previously beyond the reach of other common training techniques. Experiments show that SRL not only masters the standard of mathematical reasoning, but also effectively generalizes agent software engineering tasks.

SRL is a versatile training framework that can elevate small and low-cost models to high inference capabilities.

Limitations of current LLM reasoning training

Recent advances in training large language models (LLMs) for inference have been driven largely by reinforcement learning with large-scale reliability rewards (RLVR), a method where a model is rewarded based on the accuracy of its final answer. By repeatedly trying to solve problems and receiving feedback on final results, the model gradually learns effective problem-solving strategies.

However, the success of this results-based approach depends on the model’s ability to discover a correct solution within a limited effort, or "Roll out" Because each rollout is computationally expensive, models cannot try indefinitely. This method hits a wall when the problems become so difficult that the model rarely, if ever, finds the correct answer within its budget.

This creates a critical barrier to learning. In many multilevel reasoning problems, a model can solve several steps correctly but is derailed by a single error, leading to an incorrect answer. With RLVR, this entire effort is negatively rewarded, and the model learns nothing from its partially correct work. It’s nothing else that fails to provide granular feedback and provides sparse rewards.

An alternative method is supervised fine-tuning (SFT), where the model learns from examples containing a complete reasoning process presented by experts. Although SFT can induce reasoning abilities, it often leads to over-fitting (the model easily learns to simulate moves in the training data rather than learning to generalize to problems beyond its examples). This problem is made worse by the fact that high-quality, human-generated training data is scarce and expensive.

As the paper notes, these limitations remain "A key difference for training small open-source models to efficiently learn hard problems."

How supervised reinforcement learning works

SRL introduces a framework that optimizes problem solving "order decision-making process," Striking a balance between pure inference-based RL and pure analogical learning. Rather than simply optimizing for the final answer or forcing the model to simulate an expert’s entire thought process, SRL teaches the model to reproduce the sequence of key actions that form the backbone of expert reasoning. This allows the model to learn expert-like actions by developing its own internal reasoning style.

In the SRL framework, expert demonstrations are broken down into a series of intermediate, concrete steps, each representing a meaningful step. For a math problem, an operation can be an algebraic manipulation. For a software engineering agent, it might be an execution command in a code repository. To generate training data, SRL uses a powerful teacher model to generate solution speeds, which are then used to train smaller models.

According to I-HUNG HSU, a research scientist at Google and co-author of the paper, this middle-ground approach is key to its effectiveness in real-world scenarios. "SRL sits in the middle: it captures the structural flexibility of real-world problem solving, where there are multiple legitimate strategies but also implicit ideas that look like ‘good reasoning’ at each step." HSU told VentureBeat. "This makes SRL suitable for domains such as data science automation or perhaps supply chain optimization – tasks that reward sound intermediate reasoning rather than just definitive answers."

During training, the model first generates a "Inner monologue" (its internal reasoning process,< سوچنے> enclosed in tags) before committing an action. At each step, SRL provides a reward based on the similarity between the model’s predicted action and the expert’s action. This stepwise reward system provides dense, fine-grained feedback, allowing the model to learn and improve even if its overall solution is not perfect. This solves the problem of sparse reward to RLVR faces.

SRL in action

The researchers’ experiments showed that SRL significantly outperformed robust baselines in both mathematical reasoning and agent software engineering benchmarks. They also observed that SRL encourages more flexible and sophisticated reasoning patterns in models, such as international planning and self-validation, which only improve the quality of the solution without making the output too long.

For enterprise leaders, efficiency gains are only valuable if they don’t come with runaway costs. HSU has shown that SRL-trained models are more efficient in their reasoning. "The benefits come from improved reasoning quality and structure, not verbally." He said. "In terms of performance, SRL-trained models are equivalent to the base model in token usage … While SRL is not designed to minimize estimation cost, it achieves robust inference performance without increasing it."

For math tests, the team performed well Qwen2.5-7b-instruct on a dataset of 1,000 difficult math questions. They compared its performance against models trained with SFT and RLVR (using GRPO algorithm common in models such as Dipsec-R1) on four competitive-level math standards. The SRL-trained model significantly improved the performance by 3.0% on average compared to the other methods.

The team expanded SRL into agent software engineering, a critical domain for enterprise automation. They trained specific models on coding, QWEN2.5-CODER-7B-instructon 5,000 expert tricks of agents interacting with coding environments. The SRL-trained model was benchmarked against the original base model and SWE-GEM7B, a robust baseline with SFT. SRL achieved a 14.8% task solution rate, representing a 74% relative improvement over the SFT-based model. This demonstrates SRL’s ability to train more capable AI agents for complex, real-world programming tasks.

A New Standard for High-Stakes AI?

The paper’s strongest results come from combining methods: first, using SRL to teach basic reasoning, then using RLVR to improve that skill. In their experiments, when the researchers used SRL as pre-training and applied RLVR in post-training, they observed an average increase of 3.7.7%, demonstrating a powerful curriculum learning strategy.

This raises the question of whether this could become a new blueprint for building specialized AI.

"We see SRL as a strong foundation," Hsu said. "In a sense, SRL provides a curriculum. This SRL-first approach not only stabilizes the subsequent SRL phase but also makes the reasoning more interpretable and generalizable, which is important for high-stakes applications."

Looking ahead, HSU recognizes that there are still challenges to scaling this pipeline, particularly the high cost and complexity of end-to-end RLVR for agentic tasks. However, he is optimistic about the way forward. "While high-quality expert tricks are important," He concluded, "We think the next big leap will come from automating their generation and filtering—leveraging robust teacher models or even bootstrapping new data to improve student models themselves."

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro