Gato
- A Generalist Agent
- Gato
- single generalist agent beyond the realm of text outputs, inspired by progress in large-scale language modeling
- multi-modal, multi-task, multi-embodiment generalist policy
- same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens
- To enable processing this multi-modal data from different tasks and modalities, it is serialized into a flat sequence of tokens
- In this representation, Gato can be trained and sampled from akin to a standard large-scale language model
- Masking is used such that the loss function is applied only to target outputs, i.e text and various actions
- During deployment, sampled tokens are assembled into dialogue responses, captions, button presses, or other actions based on the context
- Transformer sequence models are effective as multi-task multi-embodiment policies, including for real-world text, vision and robotics tasks