GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

Kaichen Zhou1,2,*, Yuzhen Chen1,*, Fangneng Zhan2, Hang Hua4, Grace Chen1, Xinhai Chang2, Ao Qu2, Yilun Du1, Zhuang Liu3, Paul Pu Liang2,†, Mengyu Wang1,†
1Harvard AI and Robotics Lab, Harvard University 2Media Lab and EECS, MIT 3Computer Science, Princeton University 4MIT-IBM Watson AI Lab
*Equal contribution as first authors.    Joint supervision.
Teaser

Fig. 1: Overview. In this paper we propose a world-model robot vision planner system. By take the first observation as input, our world model GEM-4D predict a robot-based video, which is then used in the Dynamic Inverse System to extract robot policy. Finally, this policy is used in real robot experiments.

Abstract

Video world models can generate realistic futures from a single instruction, but they often fail to preserve consistent point-level motion over time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across both simulation and realistic scenarios, and improves real-world manipulation success from 61% to 81%.

Method

Method Overview. GEM-4D consists of two stages: geometry-consistent video world modeling and video-to-action extraction. First, we train a latent video DiT to generate future rollouts from an initial observation and a language instruction. Through Geometry-Enhanced Velocity Alignment, a frozen geometry foundation model and a parallel geometry DiT distill dense correspondence structure into the video backbone, encouraging rollouts to preserve depth, camera motion, and scene-flow consistency. At inference time, the geometry branch is discarded, so rollout generation remains as efficient as a single video DiT.

Second, an Adaptive Inverse Dynamic System converts the generated rollout into executable robot arm actions. It grounds the target object and end-effector in 3D, tracks the end-effector motion, adaptively corrects tracking or pose failures, and converts the recovered trajectory into robot commands through inverse kinematics.

Figure 2

Fig. 2: Adaptive Inverse Dynamic System. Given a generated video as input, this system extracts a robot policy through the four steps illustrated in the figure.

Figure 3

Fig. 3: Real-robot rollouts. From left: ground-truth video, GEM-4D-generated RGB, and the back-projected 3D point cloud. The model produces realistic and geometrically coherent rollouts under unseen backgrounds, supporting transfer to UF ARM manipulation.

Generated Video

Droid
RLBench
Bridge
RT-1
Pick up the lion and put it into the black bowl
Close the middle drawer
Sweep up the nuts
Pick up the coke and place it on the table
Move the green cup
Close the lid of the orange jar
Fold the blue towel
Pick up the soda from the fridge

Real Robot

Generated Video
Real Robot
Prompt Lift the numbered block with the number 3
Generated Video
Real Robot
Prompt Pick up the orange cup
Generated Video
Real Robot
Prompt Pick up the red pepper
Generated Video
Real Robot
Prompt Put the trash into the trash bin

Conclusion

We present GEM-4D, a geometry-enhanced video world model for robot manipulation. GEM-4D distills geometric structure from foundation models during training, enabling future rollouts that are both visually realistic and correspondence-consistent, without adding inference cost. An adaptive inverse dynamics system further converts these rollouts into executable robot trajectories. Across real-world and simulated benchmarks, GEM-4D improves video prediction, geometric fidelity, and manipulation success, highlighting geometry-aware generation as a practical step toward reliable embodied world models.