same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens
To enable processing this multi-modal data from different tasks and modalities, it is serialized into a flat sequence of tokens
In this representation, Gato can be trained and sampled from akin to a standard large-scale language model
Masking is used such that the loss function is applied only to target outputs, i.e text and various actions
During deployment, sampled tokens are assembled into dialogue responses, captions, button presses, or other actions based on the context
Transformer sequence models are effective as multi-task multi-embodiment policies, including for real-world text, vision and robotics tasks