object detection as a language modeling task conditioned on the observed pixel inputs
Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural network to perceive the image and generate the desired sequence