Textless Speech Emotion Conversion

  • Textless Speech Emotion Conversion Using Discrete and Decomposed Representations
  • Speech emotion conversion
  • modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity
  • spoken language translation task
  • decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic Features, speaker, and emotion
  • modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic Features based on these units
  • speech waveform is generated by feeding the predicted representations into a neural vocoder
  • beyond spectral and parametric changes of the signal
  • model non-verbal vocalizations, such as laughter insertion, yawning removal, etc