The remarkable success of Openi’s O1 Series and DPSEC-R 1 has clearly shown the power of the Kimk Learning Learning (RL) to remove the sophisticated reasoning behavior and significantly enhance the capabilities of large language models (LLM).

However, the basic training methods behind these basic reasoning models are often found in their technical reports. The recent efforts of the community have primarily focused on mathematical reasoning, which makes the challenge of normalizing cross domain widespread. In addition, learning standard reinforcements from preferential correction (GRPO) training is suffering from common issues such as performance barriers, use of defective samples, and specializing in dealing with mixed domain datases. These challenges complicate the effective measurement of RL methods for LLM.
After dealing with these boundaries, researchers from the Co -Capilot team in Koshu have introduced a novel infection Learning Framework: Two -Step Date -strengthening Policy Optimization (SRPO). This modern approach is designed to address the challenges of the aforementioned training in many dimensions. The team has publicly released a technical report that describes the complexities of their training procedures and has also opened the open. SRPO-QWEN-32b Model.
Specifically, this work indicates it First instance of achieving the simultaneous Dipic-R 1-zero-level performance in both mathematics and code domains. By taking advantage of the same base model as DPSC (QWEN2.5-32B) and employing a fully-learning training approach, the SRPO has achieved impressive results on Aime24 (50) and Livecodebinch (41.6) benchmarks, which have led to the depressic-R1-ZRO-32B.
Even more notable is that SRPO achieves only this level performance with only Tenth of training measures R1- Required by zero.
Challenges with Vanilla GRPO
In his preliminary investigation, the Capital Team tested with the standard GRPO algorithm. However, they quickly encountered obstacles that prevented the model from reaching the desired R1-zero performance level. These affairs include:
- Cross Domain Optimization Disputes (Mathematics vs. Code): Mathematics issues clarify the long and more detailed reasoning speed (long coat), while code data shows a weak tilt towards it. Directly combined with these two data types, causing conflicts, resulting in the most performance in both domains.
- Similar group prizes reduce training performance: The GRPO relies on the change of non -zero prizes within the sample group to calculate the algorithm benefit. When the rollout within a group gains almost identical prize values, the calculation advantage comes closer to zero. If an important part of the training batch exhibits this trend, an effective gradual contribution is minimized, which reduces the performance of the training.
- Pre -Performance Cantonment: The GRPO training faced a preliminary performance and a reward saturated on the benchmark diagnosis. The issue was partially attributed to the standard of inadequate data. When training data lacks a significant complexity or diversity, especially with the abundance of simple problems, the model easily maintains its performance on tasks conservatively, and challenging issues have the ability to promote the required complex and depth reasoning.
Two -step training
To resolve the length of the reaction between mathematics and code domains, the KWAIPILOT team enforced a two -step training sample:
- Step 1: Eliminating reasoning capabilities: This initial phase of training is especially focused on challenging mathematics data. The main purpose is to fully stimulate the model’s test time scaling, promoting abilities such as reflective, back tracking, and step -by -step rot.
- Step 2: Skill Integration: At this stage, code data has been introduced in the training process. The formation of the reasoning foundation established in Step 1, the purpose of this phase is to further enhance coding capabilities, while gradually strengthening procedures, repetition and toll -call abilities.
Comparative analysis of training strategy
The effects of various training data strategies were analyzed at the length of response, which revealed the following insights:
- Mixed Training: Trained models on the mixture of mathematics and code data showed limited growth in the response length and poor performance of the benchmark. Although math issues have given rise to some reasoning samples, the short, direct response results in code issues are focused on quick code output with minimal preliminary analysis or planning.
- Only Math Training: Fully training on mathematical data resulted in the reaction length and excellent performance on the mathematical benchmark. Significantly, it promoted strong and general reasoning capabilities. When facing programming works, the model sought detailed, phased arguments, including complex scrutiny and review measures to solve the math problem.
- Only code training: While performing better on the code benchmark, clear reasoning behavior growth was low, and the reaction length was difficult to increase significantly. The answers to both codes and math issues were significantly lower than just mathematics training, code solutions are often produced directly without step -by -step reasoning or preliminary analysis.
- Staging Training: The two -phase training approach proposed by the Caphate Team yielded high results in both mathematics and programming domains. The model created a permanent step for programming works for mathematical issues and structural reasoning patterns. Specifically, complex behavior has emerged, such as the model uses the uniform code to help argue mathematics.
Resetting the date
The Koi Pilot team observed that during the late training stage from the middle of the training, about 50 % of the groups of the beach created the same awards. This is often the time when the model succeeded on a permanently easy issues, which leads to minimal rewards and ineffective gradual refreshments.
They introduced to remove this inability and improve the quality of a gradual signal, they introduced Resetting the date. During the training, he recorded the results of all rollout prizes in each period. At the end of a period, they restruited the dataset for the next covenant based on the following standards.
- Filtering more simple than needed: Samples where the correct responses were removed as a result of all rollouts, as they did not provide any informational signal for the policy improvement.
- Maintain Information Samples: Samples of diverse results (both right and wrong) or all wrong results were retained. These patterns changed positive rewards by ensuring non -zero benefits and effective gradual gestures. In addition, difficult specimens where all rollouts in the present day were wrong. The rationality is that these initially challenging problems can be relatively easier for the latest policy, thus producing effective temperaments in the resulting training. This strategy is in accordance with the principle of curriculum education, which gradually attracts the model to the rapidly challenging patterns to enhance training performance.
Compared to the proposed dynamic sample method in the DAPO, resetting the date significantly improves computational performance and results in a more stable increase in the length of the reaction.
Data
The KWAIPILOT team cleaned and filtered complex data on publicly available code and math datases. They applied hoveristic rules to filter the unrelated URL, form noise, and ensure the complete fields (question and answer land truth) in the original data. After the cleansing of prime data for mathematics data, they require multilateral questions, pure -proof -based issues, and an image or table understanding. Lack of code data, they deleted issues depending on the specific environment, file I/O, or network conversation, focusing on the algorithmic logic.
Prior to the injection of data, they endorsed both mathematics and code for both mathematics and code to ensure the accuracy and resolution of the answers, rejecting those with false or ambiguous solutions. After that, they assessed the difficulty of every problem, classified them in a simple, medium and strict level on a pass rate (pass@k).
Experimental results
This section describes the experimental results obtained using the SRPO procedure. The Coopelot team focused on observing the reaction lengths such as rewards and measurements changes during the training.
The process of training
The aforementioned data explains the full prize curve and curves of the reaction length during the SRPO training. After the initial reward growth, the training changed to the second phase. At the beginning of the second phase, the overall prize was reduced due to the advance lack of training on the model code, followed by a permanent increase in the prize during later training. Connecting the code data did not significantly increase the reaction length, which is in line with their expectations. At the same time, the results of the benchmark indicated a permanent and stable improvement in both the model’s mathematics and coding capabilities, which showed the effect of the new method.
Specifically, the history of the date ensured that gradual updates are effective in each training phase, which directly increases the proportion of information gradual. This better performance of the sample led to a stable prize growth, which was clearly performed better through the training strategy.
The reasoning of the reasoning
The KWAIPILOT team identified the reflection samples of three representatives: re -examination, hesitation and search. He analyzed the data containing these samples and recorded the average response length for each. During RL training, he saw a gradual increase in the frequency of self -reflection, correction and back tracking of the model, which indicates the appearance of “self -esteem” ability. He says that in the model during the RL, the appearance of “reflection” is equivalent to the human academic process, which is an adaptive behavior as a result of the policy reform process.
As shown in the aforementioned data, the previous stages of training in the model did not have any active scrutiny and reflected in the previous reasoning measures. However, as training was developed, the model showed important reflection and back tracking behavior, which created reaction samples such as phased reasoning, numerical alternatives, phased verification, and self -related correction.
Interestingly, he also discovered that the model learned to use the program code for verification when solving mathematics issues. It will first provide a solution through mathematics reasoning and then actively write the program code to confirm the accuracy of the solution. These examples showed the ability to take advantage of the thinking of self -improvement and many efforts, which further indicates that in the post -training stages, the model mastered the integrated application of widespread thinking and code -based procedures to solve the problem.
Paper SRPO: A cross domain implementation of large -scale learning on LLM is underway Archeo
Try with srpo-qwen-32b Model on The surface of the throat