Meta's Dreamium framework trains AI agents in a simulated world to reduce the cost of reinforcement learning.

Meta’s Dreamium framework trains AI agents in a simulated world to reduce the cost of reinforcement learning.

Researchers at Meta, University of Chicago, and UC Berkeley have developed a new framework that addresses the high costs, infrastructure complexity, and unreliable feedback associated with using reinforcement learning (RL) to train large language model (LLM) agents. framework, Dreamgamesimulates an RL environment to train agents for complex applications. As it progresses through the training process, the framework dynamically adjusts the difficulty of the task, ensuring that the agent gradually learns to optimize more difficult problems.

The research team’s experiments show that DreamGame significantly improves RL training in fully simulated settings and scenarios where the model must apply its simulated learning to the real world. In settings where RL is feasible but expensive, it matches the performance of popular algorithms using only artificial interactions, significantly reducing the costs of data collection and interaction of the environment.

This approach can be critical for enterprises, allowing them to train agents for bespoke applications while avoiding the complexities of setting up and running a live RL environment.

The challenge of training LLM agents

Reinforcement learning Agentic is a key technique for training LLMs to handle complex tasks in environments, such as web navigation, tool use, and robotics. This allows the model to learn from direct interaction and experience, going beyond the static datasets used in pre-training.

However, RL is difficult for agent training. Real-world applications often involve long action sequences with sparse signals, meaning the agent receives a positive signal only after a long and accurate sequence.

Collecting sufficiently diverse and validated data is also expensive, often requiring human experts to verify tasks and interpret results. And the infrastructure required to create a live environment for large-scale RL training can be prohibitively complex and expensive. It goes without saying that interacting with the systems directly carries risks, as wrong actions (such as deleting a file) can cause irreparable damage.

“These limitations make building general-purpose and scalable systems for training agents with RL an open and pressing challenge,” the researchers write.

DreamGame directly challenges that model by delivering comparable performance in full simulation, removing the infrastructure burden that has prevented most enterprises from adopting RL – and giving teams a practical way to train agents without touching expensive or dangerous live environments.

How Dreamgame works

The researchers describe DreamGame as “a unified and extensible RL framework that synthesizes diverse experience data in an online fashion to enable effective and efficient training of LLM agents.” It is built around three basic components that work together to create a controlled and efficient training loop.

The first component is a “reasoning-based experience model” that translates the dynamics of the target environment into the text space. This model acts as a simulator of the application environment. Instead of interacting with the expensive real environment, the agent interacts with the model, which generates constant state transitions and feedback based on the agent’s actions.

The researchers argue that agent training does not require a realistic environment, but data that is "There is a fairly diverse, informative and formal base." For example, in a web shopping task, the model synthesizes neat lists of on-page elements instead of processing raw HTML code. This abstract approach makes the experiment model highly efficient, requiring only a small amount of public data.

The second component is an “experience replay buffer”, which acts as dynamic memory. At the beginning of the training process, the buffer is prepared with offline data to provide the necessary context and is continuously updated with new synthetic trajectories generated during training. This buffer helps guide the predictions of the experiment model, ensuring that simulations remain diverse and grounded in reality.

The third component, a “curriculum task generator,” works in conjunction with the experience model to adaptively create new tasks that are progressively more difficult. It identifies tasks where the agent’s performance is compromised (indicating that they are difficult but solvable) and creates variations to further the agent’s capabilities.

Together, these components form a closed-loop system for scalable agent training. “By combining interaction, memory, and adaptive online task generation, Dreamium addresses persistent challenges that have limited RL for training LLM agents: prohibitive cost, lack of diverse tasks, unstable reward cues, and heavy infrastructure demands,” according to the researchers.

Dreamgame in action

The researchers evaluated Dreamium in several agent benchmarks, including Webshop (e-commerce), Alfurd (embodied control), and Webrina (realistic web interaction). They used to Llama 3 And Kevin 2.5 Backbones as an agent as a model and compared Dremium against several conventional training strategies. These include offline methods such as Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO) as well as online RL algorithms such as Proximity Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which optimize agents through direct interaction with the environment.

DreamGame shows its most significant advantage in environments like Webrina, where large-scale RL infrastructure is difficult to set up. Agents fully trained within DreamJam achieved 30% higher success rates than baseline methods, which struggle with sparse rewards and limited exploration in real environments. This demonstrates that DreamGame is a mechanism that makes RL training “viable in domains that were previously complicated by inherent task and engineering constraints,” the researchers said.

In an environment where RL is supported but expensive, agents trained using GRPO and PPO performed similarly to those trained, but without any costly interaction with the external environment. The team also introduced a sim-to-real approach, DreamGym-S2R, where an agent is first trained in a simulated environment and then fine-tuned on a small amount of real-world data. This strategy achieved more than 40% performance improvement compared to training from scratch in a real environment while using less than 10% of external data. It provides an extensible "Hot start" For training of general purpose agents.

Finally, the framework demonstrated strong generalizability. An agent trained on tasks in one domain, such as Webshop, can successfully transfer its learned skills to another, such as Webrina. The researchers suggest that this is because DreamGame agents learn in one "In place of an abstract meta-representation, the agent is able to learn domain-agnostic behavioral inferences rather than memorizing task-specific patterns."

While still in its early stages, DreamGame shows that artificial environments can provide tremendous benefits in training agents. In practice, an enterprise can gather a small amount of detail for the processes and tasks it wants to automate. It can then use this small seed to bootstrap the Dreamium framework for scalable agents and efficient sample training.

Editor's pick

Get latest news

Meta’s Dreamium framework trains AI agents in a simulated world to reduce the cost of reinforcement learning.

The challenge of training LLM agents

How Dreamgame works

Dreamgame in action

An engagement of creative cores By John Levine | Stories of Impossible Magic | November, 2025

A game developer’s guide to understanding screen resolution

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news