Odyssey 2018 - How to train your speaker embeddings extractor June 29, 2018
Mitchell Mclaren, Diego Castán, Mahesh Kumar Nandwana, Luciana Ferrer and Emre Yilmaz
With the recent introduction of speaker embeddings for text-independent speaker recognition, many fundamental questions require addressing in order to fast-track the development of this new era of technology. Of particular interest is the ability of the speaker embeddings network to leverage artificially degraded data at a far greater rate beyond prior technologies, even in the evaluation of naturally degraded data. In this study, we aim to explore some of the fundamental requirements for building a good speaker embeddings extractor. We analyze the impact of voice activity detection, types of degradation, the amount of degraded data, and number of speakers required for a good network. These aspects are analyzed over a large set of 11 conditions from 7 evaluation datasets. We lay out a set of recommendations for training the network based on the observed trends. By applying these recommendations to enhance the default recipe provided in the Kaldi toolkit, a significant gain of 13-21% on the Speakers in the Wild and NIST SRE’16 datasets is achieved.
Cite as: Mclaren, M., Castán, D., Nandwana, M.K., Ferrer, L., Yilmaz, E. (2018) How to train your speaker embeddings extractor . Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 327-334, DOI: 10.21437/Speaker Odyssey.2018-46.