02
Jul
arXiv:2407.00114v1 Announce Type: new Abstract: We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $tau$ = {$o_0$, $a_0$, $dots$} and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented…