self-supervised discrete representations for the task of speech resynthesis
separately extract low-bitrate representations for speech content, prosodic information, and speaker identity
This allows to synthesize speech in a controllable manner
evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings’ intelligibility, and overall quality using subjective human evaluation