Listen Attend Spell
- Listen, Attend and Spell
- LAS
- learns to transcribe speech utterances to characters
- nlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly
- sequence-to-sequence framework
- trained end-to-end and has two main components: a listener (encoder) and a speller (decoder)
- listener is a pyramidal RNN encoder that accepts filter bank spectra as inputs, transforms the input sequence into a high level feature representation and reduces the number of timesteps that the decoder has to attend to.
- The speller is an Attention-based RNN decoder that attends to the high level Features and spells out the transcript one character at a time
- The proposed system does not use the concepts of phonemes, nor does it rely on pronunciation dictionaries or HMMs
- bypass the Conditional Independence assumptions of CTC, and show how they can learn an implicit language model that can generate multiple spelling variants given the same acoustics
- producing character sequences without making any independence assumptions between the characters is the key improvement of LAS over previous end-to-end CTC models
- used samples from the Softmax classifier in the decoder as inputs to the next step prediction during training
- show how a language model trained on additional text can be used to rerank their top hypotheses
- Google voice search task