GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

Fig. 1: Overview. In this paper we propose a world-model robot vision planner system. By take the first observation as input, our world model GEM-4D predict a robot-based video, which is then used in the Dynamic Inverse System to extract robot policy. Finally, this policy is used in real robot experiments.

Abstract

Video world models can generate realistic futures from a single instruction, but they often fail to preserve consistent point-level motion over time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across both simulation and realistic scenarios, and improves real-world manipulation success from 61% to 81%.

Method

Method Overview. GEM-4D consists of two stages: geometry-consistent video world modeling and video-to-action extraction. First, we train a latent video DiT to generate future rollouts from an initial observation and a language instruction. Through Geometry-Enhanced Velocity Alignment, a frozen geometry foundation model and a parallel geometry DiT distill dense correspondence structure into the video backbone, encouraging rollouts to preserve depth, camera motion, and scene-flow consistency. At inference time, the geometry branch is discarded, so rollout generation remains as efficient as a single video DiT.

Second, an Adaptive Inverse Dynamic System converts the generated rollout into executable robot arm actions. It grounds the target object and end-effector in 3D, tracks the end-effector motion, adaptively corrects tracking or pose failures, and converts the recovered trajectory into robot commands through inverse kinematics.

Fig. 2: Adaptive Inverse Dynamic System. Given a generated video as input, this system extracts a robot policy through the four steps illustrated in the figure.

Fig. 3: Real-robot rollouts. From left: ground-truth video, GEM-4D-generated RGB, and the back-projected 3D point cloud. The model produces realistic and geometrically coherent rollouts under unseen backgrounds, supporting transfer to UF ARM manipulation.

Generated Video

Droid

RLBench

Bridge

RT-1

Pick up the lion and put it into the black bowl

Close the middle drawer

Sweep up the nuts

Pick up the coke and place it on the table

Move the green cup

Close the lid of the orange jar

Fold the blue towel

Pick up the soda from the fridge

Real Robot

Generated Video

Real Robot

Prompt Lift the numbered block with the number 3

Generated Video

Real Robot

Prompt Pick up the orange cup

Generated Video

Real Robot

Prompt Pick up the red pepper

Generated Video

Real Robot

Prompt Put the trash into the trash bin

Conclusion

We present GEM-4D, a geometry-enhanced video world model for robot manipulation. GEM-4D distills geometric structure from foundation models during training, enabling future rollouts that are both visually realistic and correspondence-consistent, without adding inference cost. An adaptive inverse dynamics system further converts these rollouts into executable robot trajectories. Across real-world and simulated benchmarks, GEM-4D improves video prediction, geometric fidelity, and manipulation success, highlighting geometry-aware generation as a practical step toward reliable embodied world models.