X Vectors
- X-Vectors: Robust DNN Embeddings for Speaker Recognition
- data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition
- trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings called x-vectors
- prior studies have found that embeddings leverage large-scale training datasets better than i-vectors, it can be challenging to collect substantial quantities of labeled data for training
- use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness
- Their data augmentation strategy employs additive noises and reverberation
- Reverberation involves convolving room Impulse responses (RIR) with audio
- simulated RIRs described by Ko et al.
- reverberation itself is performed with the multicondition training tools in the Kaldi ASpIRE recipe
- For additive noise, they use the MUSAN dataset,
- PLDA classifier is used in the x-vector framework to make the final decision, similar to i-vector systems
- x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese where they achieve superior performance on the evaluation datasets