3D visual-language-action generation world model

Source: @_akhaliq

Recent visual-language-action (VLA) models rely on 2D inputs and lack integration with the broader 3D physical world. In addition, they performed action prediction by learning direct mappings

realization principle

Perceiving action ignores the huge dynamics of the world and the relationship between action and dynamics. In contrast, humans are given a model of the world that can depict imagination about future scenarios and plan corresponding actions.
To this end, 3D-VLA is recommended and introduces a series of new concrete basic models to seamlessly link 3D perception, reasoning, and action through a generated world model.
Specifically, the 3D-VLA is built on top of a 3D-based Large Language Model (LLM), and a set of interactive tags builds a large-scale 3D-embodying instruction dataset by extracting a large amount of 3D-related information from existing robot datasets.
Experiments on preserved datasets have shown that 3D-VLA builds large-scale 3D embodying instruction datasets by extracting a large amount of 3D related information from existing robot datasets. Our experiments with preserved datasets show that 3D-VLA

https://arxiv.org/abs/2403.09631

Video: