Whisper

  • Audio-to-Text converter
  • multi-lingual speech recognition, translation and language identification
  • goal of a speech recognition system should be to work reliably out of the box in a broad range of environments without requiring supervised fine-tuning of a decoder for every deployment distribution
  • lack of a high-quality pre-trained decoder.
  • 680,000 hours of labeled audio data
  • broken in 30 second segments paired with the subset of the transcript that occurs within that time segment.
  • encoder-deccoder transformer