World Modeling Is Probably All You Need
Simulation is probably the most important thing our brains do.
Since the release of o1, there’s been a renaissance in RL for language models. Part of the rationale behind that is that enough of the text-projection of world knowledge is captured in the language model.
Dario Amodei has mentioned that building GPT was necessary so that there was sufficient world knowledge to do RL on.
There have been fascinating results in seeing language models traverse through the action-space of language to accomplish sparse rewards. Mostly because language is a simulation that can be more easily explored. It can also be its own self-evaluator. That’s effectively what we do when we are thinking. We use our own mind to plan and evaluate to the best of our knowledge (i.e. our world model).
Ghosts of robot learning’s past
Robot learning has somewhat shifted away from RL. RL was just deemed too hard to do efficiently, and starting in late 2022, it seems the field has shifted to behavior cloning. Since then, robot learning has been trying to build on top of the world knowledge of language models. The biggest success has been coaxing out the semantic generalization from these vision-language models.
But, I think this might be a misguided approach if you actually wanted to build embodied AGI. Robots and humans need to imagine what they are going to do. They should have mental simulations of the physical world. We probably don’t optimize on the real reward we received, but our expected reward. In effect, we do planning based on expected value. That was Rich Sutton’s core thesis of TD-learning: that our inner critic helps us evaluate our expected reward. And language models provide us the generality to use them as evaluators for many rewards.
Where are we in vision modeling?
We are probably still in the very early stages of visual world modeling. Recent results show that we are getting better at fitting visual representations, but we are still likely underfitting. In the same way that Dario said that first a language model needed to be build so we could use RL on it, I think we’re still in that stage for visual modeling. Once we have a sufficient video model, then we could probably do “visual thinking”. How can we iterate in the pixel space to accomplish a complicated long-horizon end result. And once we figure that out, how can we turn that path of visual iteration into actions? In the way language models think, they still adhere to sensible plans because they know what is a valid path of thinking. They are surprisingly good at not rewarding hacking to achieve that long-term reward. So in the same analogy, the vision model should produce a realistically plausible path of thinking to accomplish some desired end goal.
Simulate the world
The core issue of robot learning is that we don’t have scale. There has never been a robotics lab with more than 1000 robots. The largest robot manipulation fleets in research labs are probably around 200-300 robots.
Waymo achieved scale the old-fashioned way. They manually collected data from 2009 over 15 years until they hit their commercial breakthrough in 2024. If Waymo started today, it’s unclear whether they would do it the same way. They do famously have large scale resimulation of logs, but it still required quite a bit of data to get there. But to their success, they have scaled deploying a large fleet of mobile robots by motivating a use-case as a product.
But what if you don’t have a product to argue for a large fleet of robots? How can you make a diverse robot manipulation model? I’d argue the best case scenario would be to turn robotics into a software problem.
By collecting training data mostly in a visual world model, and evaluating policies also in that visual world model, you could make robots smarter only by using inference compute. The more they think about things to do, the more they can learn to do new tasks. Ultimately, you could even do the RL entirely in the visual world model to learn by dreaming.
This also would solve the cold-start problem for switching to new robot hardware. Many times in my career, I’ve seen people cling to old robots for too long because they don’t want to abandon their dataset collected on a previous robot. If you could rapidly develop data for a new robot, then you could iterate on hardware much more quickly without the risk of data becoming obsolete if new data could be cheaply regenerated.
Beyond robotics
I’m biased when talking about robotics, but broader than robotics, I think world modeling might be the most important problem to work on to develop AGI, especially human-like AGI. A common question I like to ask people is if they know whether babies take their first step or say their first word earlier. It’s a bit of a trick question, because it’s about 50-50. We first learn to do visual world modeling, and then in parallel, we attach semantics and actions to that world model.
Vision is the most high-bandwidth signal we take in as humans. At 1.25 million bytes-per-second of visual input, we consume a lifetime of 2.76 petabytes of input over a 70 year lifespan. Assuming a human reads for 1 hour per day and spends 4-5 hours per day listening to spoken audio, a person consumes about 36 MB/day of text information. That’s a lifetime of 920 GB of words. In the end, the visual-to-word ratio is about 2700-to-1.
Text is more salient, because most of our visual information that we input is not needed to be retained. We don’t need to remember the fluttering of leaves in the background. By the time that we’ve established language for a concept, it’s become a pretty high-level abstraction that’s worth communicating. A dictionary is basically the ultimate set of basis vectors needed to describe our world. At least the part that is worth communicating. That’s why language is so powerful, because it’s reduced the vision input into an extremely efficient representation to learn from.
Even in evolution, the first sensor was smell. We would move towards when smells of food were stronger and away when it was weaker. A simple one-dimensional signal that was greedily optimized. Simple motion already existed in these biliterate lifeforms such as nematodes. Then we evolved eyes, and we still had generalist physical agents that became more complex vertebrates. But even the lizard didn’t have a concept of language, even though it was a generalist physical actor that could use its limbs in combinatorially complex ways. Only at the end of evolution did we evolve higher-level cognition that language emerged from. Max Bennett’s A Brief History of Intelligence really formed my evolutionary view of intelligence.
Overall, since the world is fundamentally physical, our next orders of magnitude of compute should be used to capture the visual world. It’s likely the next frontier in making AI agents that developed more like we did.