Textless Speech Emotion Conversion
- Textless Speech Emotion Conversion Using Discrete and Decomposed Representations
- Speech emotion conversion
- modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity
- spoken language translation task
- decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic Features, speaker, and emotion
- modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic Features based on these units
- speech waveform is generated by feeding the predicted representations into a neural vocoder
- beyond spectral and parametric changes of the signal
- model non-verbal vocalizations, such as laughter insertion, yawning removal, etc