dGSLM

Generative Spoken Dialogue Language Modeling
dGSLM
first “textless” model able to generate audio samples of naturalistic spoken dialogues
unsupervised spoken unit discovery coupled with a dual-Tower Transformer architecture with cross-Attention trained on 2000 hours of two-channel raw conversational audio Fisher Spanish-English without any text or labels
generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces naturalistic turn taking

Subhaditya's KB