This new concept of how to train humanoids began in 2022 with the launch of ChatGPT. Large language models were able to generate text by exposing them to vast amounts of training data — every word ever written that AI companies could find (or, some argue, steal). Roboticists wanted to apply these scaling laws to robotics, but they didn’t have an Internet-sized collection of data describing how we move.
Given how difficult it would be to assemble, companies used practical tasks, such as teaching robots to move in virtual simulations. However, simulations never fully model how things like friction or elasticity work in the real world, so robots trained in them (literally) stumble.
Now companies making humanoid robots have decided that collecting real-world data, as cumbersome as it is, can pay off handsomely. This is where things got weird.
Early efforts were brilliant and scholarly. The labs collected hours and hours of data from people doing household chores, such as flipping waffles or cleaning their desks, while wearing cameras or handheld grippers. Data shared openly. But as venture capital money pours into robotics — $6.1 billion in 2025 for humanoids alone — the race to generate that training data has become more competitive, and more widespread.