Google’s ‘Watch and Learn’ Framework Breaks the Data Barrier to Train Computer-Based Agents

by SkillAiNest

Google’s ‘Watch and Learn’ Framework Breaks the Data Barrier to Train Computer-Based Agents

A new framework developed by researchers at Google Cloud and DeepMind aims to solve one of the key challenges in the development of computer user agents (CUAs): collecting high-quality training examples at scale.

Framework, Dubbed Watch and learn .

Their experiments show that existing computer use and foundation models can be used to train or refine their performance on computer use tasks. But just as important, the same approach can be used to create Learning context .

Data Interruption of CUA

The web is rich with video tutorials and screencasts that explain the complex workflows for using the applications. These videos are the gold mine they can provide Computer Usage Agent With instructions to accomplish various tasks through domain knowledge and user interface interaction.

However, before they can be used to train CUA agents, these videos need to be converted into annotated trajectories (i.e., a collection of task descriptions, screenshots, and actions), a process that is prohibitively expensive and time-consuming if performed manually.

Current approaches to overcome this data constraint rely on the interpretation of these videos through the use of multimodal language models, which typically results in low-accuracy and poor simulations. A different approach uses self-play agents that autonomously explore the user interface to collect trajectories. However, techniques using this approach typically produce simplified examples that are not useful in unpredictable real-world situations.

As the researchers note in their paper, “Overall, these approaches either rely on easily broken heuristics, as they rely on searches in real environments, or produce demonstrations of low complexity that are mistaken for human intent.”

Watch and learn

The Watch and Learn framework attempts to address the challenges of generating CUA demonstrations by rethinking the problem formulation.

Instead of generating direct momentum or relying on complex multi-stage pipelines, researchers frame the problem as an “inverse dynamics goal”: given two consecutive observations, predict the intermediate process that produced the transition.

According to the researchers, this formulation is “easy to learn, avoids hand-crafted heuristics and generalizes robustly across applications.”

The W&L framework can be broken down into three main steps: training an inverse dynamics model (IDM), retrieving raw videos, and training CUA agents.

In the first phase, the researchers used agents to interact directly with web pages to create a large corpus of 500,000 state transitions (two consecutive observations and an action resulting from a transition). They then used this data (along with 132,000 human asymmetric migrations from existing open datasets) to train an Inverse Dynamics Model (IDM) that operates on two consecutive observations and predicts migration action. Their trained IDM, a small transformer model, outperformed the shelf foundation model in predicting displacement actions.

The researchers then designed a pipeline that retrieves videos from platforms like YouTube and runs them through an IDM to produce high-quality footage. The IDM takes in continuous video frames and determines the actions (scroll, click) that cause changes in the environment, which are then packed into an interpreted motion. Using this method, they generated 53,125 trajectories with advanced action labels.

These examples can be used to train models for the efficient use of computers for specific tasks. But the researchers also found that the paths extracted by IDM can serve as examples of contextual learning to improve CUAs’ performance on bespoke tasks during individualization. For ICL, they use Gemini 2.5 Flash to add additional reasoning annotations to observation/action examples in motion, which can then be inserted into CUA agent prompts (typically 3-5 examples) during assessment.

“This dual role (training and context guidance) enables flexible integration with both open-source models and general-purpose agents,” the researchers write.

W&L in action

To test the efficacy of W&L, the researchers ran a series of experiments with closed and open source models. Osworld Benchmarkwhich evaluates agents in real desktop and operating system environments in a variety of tasks, including productivity, programming, and design.

For fine-tuning, they used their corpus of 53,000 trajectories to train two open-source models: UI-TARS-1.5, a robust, open-source vision language action model designed specifically for computer use, and QWEN 2.5-VLan open-weighted multimodal LL.M.

For contextual learning tests, they applied W&L instances to general-purpose multimodal models such as Gemini 2.5 Flash, OpenEye O3, and CloudSonet 4.

The W&L results improved Osworld in all model categories, including 3 points for ICL on general-purpose models and up to 11 points for fine-tuned open-source models.

More importantly, these benefits were achieved without manual annotation, “demonstrating that web-scale human workflows can serve as a practical and scalable basis for advancing CUAs toward real-world deployment.”

This could have important implications for real-world applications, enabling enterprises to convert their existing videos and conference recordings into training data for CUAs. It also makes it easier to create new training paths. All you need to do is record videos of various tasks being performed and they need to be annotated by IDM. And with frontier models constantly improving and becoming cheaper, you can expect to get more from your existing data and the field continues to grow.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro