Listen Attend Spell

Listen, Attend and Spell
LAS
learns to transcribe speech utterances to characters
nlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly
sequence-to-sequence framework
trained end-to-end and has two main components: a listener (encoder) and a speller (decoder)
listener is a pyramidal RNN encoder that accepts filter bank spectra as inputs, transforms the input sequence into a high level feature representation and reduces the number of timesteps that the decoder has to attend to.
The speller is an Attention-based RNN decoder that attends to the high level Features and spells out the transcript one character at a time
The proposed system does not use the concepts of phonemes, nor does it rely on pronunciation dictionaries or HMMs
bypass the Conditional Independence assumptions of CTC, and show how they can learn an implicit language model that can generate multiple spelling variants given the same acoustics
producing character sequences without making any independence assumptions between the characters is the key improvement of LAS over previous end-to-end CTC models
used samples from the Softmax classifier in the decoder as inputs to the next step prediction during training
show how a language model trained on additional text can be used to rerank their top hypotheses
Google voice search task

Subhaditya's KB