Pix2Seq
- Pix2seq: a Language Modeling Framework for Object Detection
- generic framework for object detection
- object detection as a language modeling task conditioned on the observed pixel inputs
- Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural network to perceive the image and generate the desired sequence
- COCO
- output can be represented by a relatively concise sequence of discrete tokens (e.g., keypoint detection, image captioning, visual question answering)
- Autoregressive
- stop inference when the ending token is produced
- applying it to offline inference, or online scenarios where the objects of interest are relatively sparse
- entirely based on human annotation