Beyond math and coding: New RL framework helps train LLM agents for complex, real-world tasks

Researchers at the University of Science and Technology of China have developed a new reinforcement learning (RL) framework that helps train large language models (LLMs) for complex agent tasks beyond well-defined problems such as math and coding.

their framework, Agent-R1is compatible with popular RL algorithms and shows considerable improvement in reasoning tasks that require multiple retrieval steps and multi-turn interactions with tools.

This framework is built on a redefinition of the RL paradigm that takes into account the dynamic nature of agent applications that require interactions with evolving environments and incomplete information. This structure is more similar to real-world applications and can be of significant use for agent tasks in enterprise settings.

Rethinking reinforcement learning for agents

RL has become the cornerstone of LLM training for well-defined reasoning tasks. In areas like math and coding, the model gets a clear signal: the answer is either right or wrong. This makes it relatively straightforward to reward or punish his behavior.

But this approach struggles with agentic tasks that require models to operate in interactive environments, develop dynamic memories in conversation, perform multilevel reasoning, and respond to unexpected feedback. Training agents with RL for these scenarios presents unique challenges, especially in multi-turn interactions where designing effective rewards is complex and the trained agent often fails to generalize to the messy, unpredictable nature of real-world environments.

To address these challenges, researchers at the University of Science and Technology revised the basic framework for RL, known as Rab. Markov decision process (MDP) An MDP models decision making using four main components: state space (the set of possible states an agent can have); action space (what the agent can do); probability of state transition (by which state the action will be led); and a reward function (whether the outcome is good or bad). This paper proposes to extend this framework to optimize LLM agents.

In the new formulation, the state space is expanded to include not only the current state (the current sequence of tokens generated by the model) but also the entire history of interactions and environmental feedback. Actions are still primarily about generating text, but specific streams of text can now trigger external tools like an API call. State transitions become unpredictable, or "stochastic," Because the outcome depends not only on the tokens predicted by the model but also on the response of the environment, which depends on external factors. Finally, the reward system becomes more granular, involving intermediates "Action rewards" For successfully completing steps along the way, rather than just a single reward. It provides more frequent and precise guidance to the agent during training.

This last bit is particularly important and solves the “sparse reward” problem that most RL frameworks face. When the agent receives a single reward signal based on the final outcome, it does not learn from the correct and incorrect intermediate actions it has taken along the way. Action rewards solve this problem by providing feedback signals at these intermediate stages, making the learning process more efficient.

“This extension is critical for enabling reinforcement learning algorithms to train sophisticated agents capable of complex, multi-factor reasoning and interaction in dynamic environments,” the researchers write in their paper.

Agent-R1 Framework

Based on the extended MDP definition, researchers developed Agent-R1a flexible and user-friendly training platform for RL-based LLM agents. It extends traditional single-turn RL frameworks to handle the multifaceted, interactive nature of agentic tasks, allowing seamless integration with diverse environments.

The most important difference lies in this "roll out phase," where the agent produces the reaction. In single-turn RL, the model generates the response once. In multi-turn RL, this process involves a series of complex back-and-forth interactions.

Agent-R1 achieves this flexible multi-turn rollout with two core modules: Tool and ToolNv. A tool module acts as an execution for specific actions such as calling an API or accessing a database. When requested, a tool executes its process and returns a direct, raw result. In contrast, ToolNV is the module orchestrator and interpreter. It takes the output from the device and determines how the output affects the state of the agent and the overall progress of the task. ToolNV manages state transitions, calculates reward signals based on tool results, and packages new state information to the agent.

In short. , when an action is completed, the tool reports "what happened" While Tool NV dictates "What does this result mean for the agent and the work?"

Agent-R1 in action

The researchers tested Agent-R1 on the challenging task of multi-hop query answering, which requires complex reasoning, information retrieval and multi-level decision-making across multiple documents. They trained QWEN2.5-3B-instruct on QA datasets and evaluated its performance. Hot pot And 2wikimultihopqa Datasets They also tested this on the music dataset, which was outside the domain of agent training tasks.

They compared different RL algorithms trained with Agent-R1 against two baselines: the naive rig, a single-pass retrieval method where an LLM returns an answer based on a set of retrieved documents, and the base tool call, which uses the model’s native function calling capability without special RL training.

The results showed that all RL-trained agents significantly outperformed the baselines. GRPO, an RL algorithm used in modern inference models such as Dipsec-R1presented the best performance of the best performance.

“These results strongly validate the utility of Agent-R1 in training LLM agents powered by end-to-end RL, showing consistent, substantial gains over baselines across diverse datasets and RL algorithms,” the researchers write.

These findings may be important for the enterprise, where there is a strong push to apply RL and reason beyond well-defined domains. A framework developed to handle messy, multidimensional interactions with users and dynamic environments can pave the way for new agents capable of solving complex problems in real-world settings.

“We hope that Agent-R1 provides the foundation for future work on extensible and unified RL training for Agentic LLM,” the researchers concluded.

Editor's pick

Get latest news

Beyond math and coding: New RL framework helps train LLM agents for complex, real-world tasks

Rethinking reinforcement learning for agents

Agent-R1 Framework

Agent-R1 in action

Tips from a Serial Career Changer with Guthub’s Andrea Griffiths (Podcast #199)

Anthropic says it fixed a long-running AI agent issue with a new multi-session cloud SDK

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news