
Researchers at Alibaba’s Tongai Lab have developed a new framework for self-generating agents that create their own training data by exploring their application environment. framework, Agent Evolureutilizes the knowledge and reasoning capabilities of large language models for autonomous learning, solving the high cost and manual effort typically required to collect task-specific datasets.
Experiments show that compared to traditional reinforcement learning-based frameworks, AgentVolver is more efficient in exploring its environment, makes better use of data, and adapts quickly to the application environment. For the enterprise, this is important because it lowers the barrier to training agents for bespoke applications, making powerful, custom AI assistants more accessible to a wider range of organizations.
High cost of training AI agents
Reinforcement learning LLM training has become an important model for acting as agents who can interact with the digital environment and learn from feedback. However, developing agents with RL faces fundamental challenges. First, collecting the necessary training datasets is often prohibitively expensive, requiring significant manual labor to generate examples of tasks, especially in novel or proprietary software environments where off-the-shelf datasets are not available.
Second, the RL techniques commonly used for LLM require running the model through a large number of trial-and-error attempts to effectively learn it. This process is computationally expensive and inefficient. Consequently, training LLM agents capable of RL remains laborious and expensive, limiting their deployment in custom enterprise settings.
How Agent Vulvar works
The main idea behind Agent Volver is to give models as much autonomy as possible in their learning process. The researchers describe it as a “self-evolving agent system” that is “designed to achieve autonomous and efficient capacity evolution through environmental interaction.” It uses the reasoning power of LLM to create a self-training loop. , which allows the agent to continuously improve by interacting directly with its target environment without the need for predefined tasks or reward functions.
“We envision an agent system where LLM actively guides exploration, task generation, and performance refinement,” the researchers wrote. Their paper.
The process of self-evolution is driven by three basic mechanisms that work together.
The first is The question itselfwhere the agent explores its environment to discover the limits of its functions and identify useful states. It’s like a new user clicking around an application to see what’s possible. Based on this research, the agent creates its own set of different tasks that are tailored to the user’s general preferences. This reduces the need for hand-crafted datasets and allows the agent and its tasks to be synchronized, enabling it to tackle progressively more complex challenges.
According to Yunpeng Xi, an Alibaba researcher and co-author of the paper, who spoke to VentureBeat, the self-questioning mechanism effectively “transforms the model from a data consumer to a data producer,” dramatically reducing the time and cost required to deploy an agent in a proprietary environment.
There is another method Self-navigatingwhich improves research efficiency by reusing and generalizing from past experiences. Agent Volver draws insights from successful and failed attempts and uses them to guide future actions. For example, if an agent tries to use an API function that does not exist in an application, it registers it as an experience and learns to verify the existence of the functions before trying to use them in the future.
A third mechanism, Self-affiliatedenhances learning efficiency by providing more detailed feedback. Instead of just a final success or failure signal (a common practice in RL that can result in sparse rewards), this method uses an LLM to evaluate the contribution of each individual action to a multi-objective task. It deterministically determines whether each step contributed positively or negatively to the final outcome, giving the agent fine-grained feedback that accelerates learning.
This is very important for regulated industries where how an agent solves a problem is as important. “Instead of rewarding a student just for the final answer, we also evaluate the clarity and correctness of each step in their reasoning,” Zhai explained. This improves transparency and encourages agents to adopt more robust and auditable problem-solving patterns.
According to the researchers, “By shifting the training initiative from human-engineered pipelines to LLM-guided self-improvement, Agent Volver establishes a new paradigm that paves the way for scalable, cost-effective, and continuously improving intelligent systems.”
The team has also developed a practical, end-to-end training framework that integrates these three mechanisms. This is an important part of the foundation Context Managera component that controls the agent’s memory and interaction history. Although today’s standards test a limited number of tools, real enterprise environments can include thousands of APIs.
Xie acknowledges that this is a fundamental challenge for the field, but notes that the agent was designed to enhance volvar. “Retrieval on very large processing spaces will always introduce computational challenges, but Agent Volver’s architecture provides a clear path toward scalable tool reasoning in enterprise settings,” he said.
A more efficient way to train agents
To measure the effectiveness of their framework, the researchers tested it Eporled And BFCL V3two benchmarks that require agents to perform long, multitasking tasks using external tools. They used Alibaba’s models The Qwen2.5 family . Dipsec-R1.
The results show that integrating all three mechanisms in Agent Volver led to a significant performance gain. For the 7B model, the average score improved by 29.4%, and for the 14B model, it increased by 27.8% compared to baseline. The framework consistently improved the reasoning and task execution capabilities of the models in both benchmarks. The most important improvement comes from the self-interrogation module, which generates diverse training tasks autonomously and directly solves the problem of data scarcity.
The experiments also proved that Agent Volver can efficiently synthesize a large amount of high-quality training data. The tasks generated by the self-interrogation module proved diverse enough to achieve good training performance even with very little data.
For enterprises, this provides a way to create agents for bespoke applications and internal workflows, minimizing the need for manual data interpretation. By providing high-level goals and allowing agents to create their own training experiences, organizations can develop custom AI assistants more easily and cost-effectively.
The researchers conclude, “This combination of algorithmic design and engineering is practical for Agent Vulvar as both a research vehicle and a reusable foundation for building adaptive, device-inflicted agents.”
Looking ahead, the ultimate goal is huge. “A truly ‘single model’ that can be dropped into any software environment and mastered overnight is surely the holy grail of agent AI,” said Xai. “We see Agent Volver as an essential step in that direction.” Although this future still requires breakthroughs in model reasoning and infrastructure, self-correlation approaches are paving the way.