What world models get you is from the particular to the universal.
Let me explain.
Training models on video is to train them on the particular - each video has different cats, rooms, tables, etc. A world model is an abstraction, a generalization.
LLMs are autoregressive: given this sequence of words, predict the next word (roughly). The objective: the best matching word. World models are trained with a very different objective: given this set of particular cats, mugs, chairs, learn the abstraction. The objective:
If you learn (memorize) a particular, you can reproduce the memory. But how do you communicate a universal? Symbolically: words. Once you have words and memories, you have solved the symbol grounding problem: a symbol signifies the concept that was learned from the instances that were trained on and memorized. The world model solves the frame problem.
All rooms are different but the concept of a room is universal. When you walk into a room, you recognize it as a room. That particular room is zero-shot