Multi-Modal Robotic World Modeling via Physically Consistent Video Generation and Cross-View Representation Alignment

Lars D. Welch; Sven Watkins; Tarun M. Raman; Massimo Wagner

Authors

Lars D. Welch School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA.
Sven Watkins Department of Computer Science, Colorado State University, Fort Collins, CO, USA.
Tarun M. Raman Department of Computer Science, University of North Texas, Denton, TX, USA.
Massimo Wagner Department of Computer Science and Engineering, University of Nevada, Reno, Reno, NV, USA.

Keywords:

world modeling, multi-modal perception, physically consistent video generation, cross-view alignment, robotic autonomy, representation learning, infrastructure governance

Abstract

The construction of accurate and coherent world models is a fundamental challenge in autonomous robotics, particularly when agents must operate in unstructured, dynamic environments. This paper introduces a unified framework for multi-modal robotic world modeling that integrates physically consistent video generation with cross-view representation alignment. The proposed architecture leverages generative video models that adhere to physical laws such as conservation of momentum, occlusion reasoning, and object permanence, thereby producing temporally coherent predictions from sparse sensory inputs. Simultaneously, a cross-view representation alignment module maps observations from disparate sensor modalities—including RGB cameras, LiDAR, depth sensors, and radar—into a shared latent space that preserves spatial and temporal consistency. We analyze the structural trade-offs inherent in designing such a system, including the balance between generative fidelity and computational efficiency, the governance of training data diversity, and the robustness of representations under distributional shift. Deployment considerations for edge computing and cloud-in-the-loop architectures are discussed, alongside sustainability metrics related to energy consumption and model carbon footprint. Furthermore, we examine fairness and policy implications arising from biased sensor configurations and uneven representation of environmental conditions. Through a synthesis of recent advances in video diffusion models, neural implicit representations, and contrastive learning, we propose a roadmap for scalable, physically grounded world modeling that can serve as a backbone for downstream planning, navigation, and manipulation tasks. This work contributes a systems-level perspective that bridges computer vision, robotics, and socio-technical infrastructure design.

References

1. Durrant-Whyte, H., & Bailey, T. (2006). Simultaneous localization and mapping: part I. IEEE Robotics & Automation Magazine, 13(2), 99–110.

2. Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2021). NeRF: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99–106.

3. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

4. Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. Advances in Neural Information Processing Systems, 29.

5. Li, Y., Lin, T., & Yi, K. (2023). Physics-aware video generation via contrastive learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18654–18663.

6. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning, 8748–8763.

7. Chen, X., Ma, H., Wan, J., Li, B., & Xia, T. (2017). Multi-view 3D object detection network for autonomous driving. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1907–1915.

8. Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., & Levine, S. (2018). Time-contrastive networks: Self-supervised learning from video. IEEE International Conference on Robotics and Automation, 1134–1141.

9. Xiong, Z., Song, Y., He, L., Xiong, W., Yuan, Y., Qiao, F., & Jacobs, N. (2026). PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment. arXiv preprint arXiv:2603.13770.

10. Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to control: Learning behaviors by latent imagination. International Conference on Learning Representations.

11. Satyanarayanan, M. (2017). The emergence of edge computing. Computer, 50(1), 30–39.

12. Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of the Conference on Fairness, Accountability and Transparency, 77–91.

13. Meng, C., Rombach, R., Gao, R., Kingma, D. P., Ermon, S., & Salimans, T. (2023). On distillation of guided diffusion models. Advances in Neural Information Processing Systems, 36.

14. Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., ... & Florence, P. (2023). PaLM-E: An embodied multimodal language model. Proceedings of the International Conference on Machine Learning.

15. Wang, Z., Wu, S., Xie, W., Chen, M., & Prisacariu, V. A. (2023). Neural 3D scene flow from event cameras. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21350–21360.

16. Padalkar, A., Pooley, J., Jain, A., Bewley, A., Herzog, A., Irpan, A., ... & Levine, S. (2024). Open X-Embodiment: Robotic learning datasets and RT-X models. arXiv preprint arXiv:2401.00929.

17. Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., ... & Malik, J. (2022). Ego4D: Around the world in 3,000 hours of egocentric video. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18995–19012.

18. Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., ... & Anguelov, D. (2020). Scalability in perception for autonomous driving: Waymo open dataset. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2446–2454.

19. Greff, K., van Steenkiste, S., & Schmidhuber, J. (2020). On the binding problem in artificial neural networks. arXiv preprint arXiv:2012.05208.

20. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.

Multi-Modal Robotic World Modeling via Physically Consistent Video Generation and Cross-View Representation Alignment

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Current Issue

Information

Make a Submission

Journal Information

Indexing & Infrastructure