Overcoming Data Scarcity in Speaker Identification: Dataset Augmentation with Synthetic MFCCs via Character-level RNN

Jordan J. Bird, Diego R. Faria, Cristiano Premebida, Aniko Ekart, Pedro P. S. Ayrosa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Autonomous speaker identification suffers issues of data scarcity due to it being unrealistic to gather hours of speaker audio to form a dataset, which inevitably leads to class imbalance in comparison to the large data availability from non-speakers since large-scale speech datasets are available online. In this study, we explore the possibility of improving speaker recognition by augmenting the dataset with synthetic data produced by training a Character-level Recurrent Neural Network on a short clip of five spoken sentences. A deep neural network is trained on a selection of the Flickr8k dataset as well as the real and synthetic speaker data (all in the form of MFCCs) as a binary classification problem in order to discern the speaker from the Flickr speakers. Ranging from 2,500 to 10,000 synthetic data objects, the network weights are then transferred to the original dataset of only Flickr8k and the real speaker data, in order to discern whether useful rules can be learnt from the synthetic data. Results for all three subjects show that fine-tune learning from datasets augmented with synthetic speech improve the classification accuracy, F1 score, precision, and the recall when applied to the scarce real data vs non-speaker data. We conclude that even with just five spoken short sentences, data augmentation via synthetic speech data generated by a Char- RNN can improve the speaker classification process. Accuracy and related metrics are shown to improve from around 93% to 99% for three subjects classified from thousands of others when fine-tuning from exposure to 2500-1000 synthetic data points. High F1 scores, precision and recall also show that issues due to class imbalance are also solved.
Original languageEnglish
Title of host publication2020 IEEE International Conference on Autonomous Robot Systems and Competitions, ICARSC 2020
EditorsNuno Lau, Manuel F. Silva, Luis Paulo Reis, Jose Cascalho
PublisherIEEE
Pages146-151
Number of pages6
ISBN (Electronic)978-1-7281-7078-7
ISBN (Print)978-1-7281-7079-4
DOIs
Publication statusPublished - 19 May 2020
Event2020 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC) - Ponta Delgada, Portugal
Duration: 15 Apr 202017 Apr 2020

Conference

Conference2020 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC)
Period15/04/2017/04/20

Keywords

  • Autonomous Systems
  • Data Augmentation
  • Generative Models
  • Human-robot Interaction
  • Speaker Identification
  • Speech Recognition

Fingerprint Dive into the research topics of 'Overcoming Data Scarcity in Speaker Identification: Dataset Augmentation with Synthetic MFCCs via Character-level RNN'. Together they form a unique fingerprint.

  • Cite this

    Bird, J. J., Faria, D. R., Premebida, C., Ekart, A., & Ayrosa, P. P. S. (2020). Overcoming Data Scarcity in Speaker Identification: Dataset Augmentation with Synthetic MFCCs via Character-level RNN. In N. Lau, M. F. Silva, L. P. Reis, & J. Cascalho (Eds.), 2020 IEEE International Conference on Autonomous Robot Systems and Competitions, ICARSC 2020 (pp. 146-151). [9096166] IEEE. https://doi.org/10.1109/ICARSC49921.2020.9096166