modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity
spoken language translation task
decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic Features, speaker, and emotion
modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic Features based on these units
speech waveform is generated by feeding the predicted representations into a neural vocoder
beyond spectral and parametric changes of the signal
model non-verbal vocalizations, such as laughter insertion, yawning removal, etc