direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation
self-supervised discrete speech encoder on the target speech
training a sequence-to-sequence speech-to-unit translation
model to predict the discrete representations of the target speech
When target text transcripts are available, they design a joint speech and text training framework that enables the model to generate dual Modality output (speech and text) simultaneously in the same inference pass