Phoneme Aware Speech Synthesis via Fine Tune Transfer Learning with a Tacotron Spectrogram Prediction Network

Jordan J. Bird, Anikó Ekárt, Diego R. Faria

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The implications of realistic human speech imitation are both promising but potentially dangerous. In this work, a pre-trained Tacotron Spectrogram Feature Prediction Network is fine tuned with two 1.6 h speech datasets for 100,000 learning iterations, producing two individual models. The two Speech datasets are completely identical in content other than their textual representation, one follows the standard English language, whereas the second is an English phonetic representation in order to study the effects on the learning processes. To test imitative abilities post-training, thirty lines of speech are recorded from a human to be imitated. The models then attempt to produce these voice lines themselves, and the acoustic fingerprint of the outputs are compared to the real human speech. On average, English notation achieves 27.36%, whereas Phonetic English notation achieves 35.31% similarity to a human being. This suggests that representation of English through the International Phonetic Alphabet serves as more useful data than written English language. Thus, it is suggested from these experiments that a phonetic-aware paradigm would improve the abilities of speech synthesis similarly to its effects in the field of speech recognition.
Original languageEnglish
Title of host publicationAdvances in Computational Intelligence Systems - Contributions Presented at the 19th UK Workshop on Computational Intelligence, 2019
EditorsZhaojie Ju, Dalin Zhou, Alexander Gegov, Longzhi Yang, Chenguang Yang
PublisherSpringer
Chapter23
Pages271-282
Number of pages12
Volume1043
ISBN (Electronic)978-3-030-29933-0
ISBN (Print)978-3-030-29932-3
DOIs
Publication statusPublished - 30 Aug 2019
Event19th UK Workshop on Computational Intelligence : UKCI 2019 - Portsmouth, United Kingdom
Duration: 4 Sep 20196 Sep 2019

Publication series

NameAdvances in Intelligent Systems and Computing
Volume1043
ISSN (Print)2194-5357
ISSN (Electronic)2194-5365

Conference

Conference19th UK Workshop on Computational Intelligence
CountryUnited Kingdom
CityPortsmouth
Period4/09/196/09/19

Fingerprint

Speech synthesis
Speech analysis
Speech recognition
Acoustics
Experiments

Keywords

  • Fine tune learning
  • Fingerprint analysis
  • Phonetic awareness
  • Speech synthesis
  • Tacotron

Cite this

Bird, J. J., Ekárt, A., & Faria, D. R. (2019). Phoneme Aware Speech Synthesis via Fine Tune Transfer Learning with a Tacotron Spectrogram Prediction Network. In Z. Ju, D. Zhou, A. Gegov, L. Yang, & C. Yang (Eds.), Advances in Computational Intelligence Systems - Contributions Presented at the 19th UK Workshop on Computational Intelligence, 2019 (Vol. 1043, pp. 271-282). (Advances in Intelligent Systems and Computing; Vol. 1043). Springer. https://doi.org/10.1007/978-3-030-29933-0_23
Bird, Jordan J. ; Ekárt, Anikó ; Faria, Diego R. / Phoneme Aware Speech Synthesis via Fine Tune Transfer Learning with a Tacotron Spectrogram Prediction Network. Advances in Computational Intelligence Systems - Contributions Presented at the 19th UK Workshop on Computational Intelligence, 2019. editor / Zhaojie Ju ; Dalin Zhou ; Alexander Gegov ; Longzhi Yang ; Chenguang Yang. Vol. 1043 Springer, 2019. pp. 271-282 (Advances in Intelligent Systems and Computing).
@inproceedings{e2e8cb15571b4dd8adc9b9081e8c75dc,
title = "Phoneme Aware Speech Synthesis via Fine Tune Transfer Learning with a Tacotron Spectrogram Prediction Network",
abstract = "The implications of realistic human speech imitation are both promising but potentially dangerous. In this work, a pre-trained Tacotron Spectrogram Feature Prediction Network is fine tuned with two 1.6 h speech datasets for 100,000 learning iterations, producing two individual models. The two Speech datasets are completely identical in content other than their textual representation, one follows the standard English language, whereas the second is an English phonetic representation in order to study the effects on the learning processes. To test imitative abilities post-training, thirty lines of speech are recorded from a human to be imitated. The models then attempt to produce these voice lines themselves, and the acoustic fingerprint of the outputs are compared to the real human speech. On average, English notation achieves 27.36{\%}, whereas Phonetic English notation achieves 35.31{\%} similarity to a human being. This suggests that representation of English through the International Phonetic Alphabet serves as more useful data than written English language. Thus, it is suggested from these experiments that a phonetic-aware paradigm would improve the abilities of speech synthesis similarly to its effects in the field of speech recognition.",
keywords = "Fine tune learning, Fingerprint analysis, Phonetic awareness, Speech synthesis, Tacotron",
author = "Bird, {Jordan J.} and Anik{\'o} Ek{\'a}rt and Faria, {Diego R.}",
year = "2019",
month = "8",
day = "30",
doi = "10.1007/978-3-030-29933-0_23",
language = "English",
isbn = "978-3-030-29932-3",
volume = "1043",
series = "Advances in Intelligent Systems and Computing",
publisher = "Springer",
pages = "271--282",
editor = "Zhaojie Ju and Dalin Zhou and Alexander Gegov and Longzhi Yang and Chenguang Yang",
booktitle = "Advances in Computational Intelligence Systems - Contributions Presented at the 19th UK Workshop on Computational Intelligence, 2019",
address = "Germany",

}

Bird, JJ, Ekárt, A & Faria, DR 2019, Phoneme Aware Speech Synthesis via Fine Tune Transfer Learning with a Tacotron Spectrogram Prediction Network. in Z Ju, D Zhou, A Gegov, L Yang & C Yang (eds), Advances in Computational Intelligence Systems - Contributions Presented at the 19th UK Workshop on Computational Intelligence, 2019. vol. 1043, Advances in Intelligent Systems and Computing, vol. 1043, Springer, pp. 271-282, 19th UK Workshop on Computational Intelligence , Portsmouth, United Kingdom, 4/09/19. https://doi.org/10.1007/978-3-030-29933-0_23

Phoneme Aware Speech Synthesis via Fine Tune Transfer Learning with a Tacotron Spectrogram Prediction Network. / Bird, Jordan J.; Ekárt, Anikó; Faria, Diego R.

Advances in Computational Intelligence Systems - Contributions Presented at the 19th UK Workshop on Computational Intelligence, 2019. ed. / Zhaojie Ju; Dalin Zhou; Alexander Gegov; Longzhi Yang; Chenguang Yang. Vol. 1043 Springer, 2019. p. 271-282 (Advances in Intelligent Systems and Computing; Vol. 1043).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Phoneme Aware Speech Synthesis via Fine Tune Transfer Learning with a Tacotron Spectrogram Prediction Network

AU - Bird, Jordan J.

AU - Ekárt, Anikó

AU - Faria, Diego R.

PY - 2019/8/30

Y1 - 2019/8/30

N2 - The implications of realistic human speech imitation are both promising but potentially dangerous. In this work, a pre-trained Tacotron Spectrogram Feature Prediction Network is fine tuned with two 1.6 h speech datasets for 100,000 learning iterations, producing two individual models. The two Speech datasets are completely identical in content other than their textual representation, one follows the standard English language, whereas the second is an English phonetic representation in order to study the effects on the learning processes. To test imitative abilities post-training, thirty lines of speech are recorded from a human to be imitated. The models then attempt to produce these voice lines themselves, and the acoustic fingerprint of the outputs are compared to the real human speech. On average, English notation achieves 27.36%, whereas Phonetic English notation achieves 35.31% similarity to a human being. This suggests that representation of English through the International Phonetic Alphabet serves as more useful data than written English language. Thus, it is suggested from these experiments that a phonetic-aware paradigm would improve the abilities of speech synthesis similarly to its effects in the field of speech recognition.

AB - The implications of realistic human speech imitation are both promising but potentially dangerous. In this work, a pre-trained Tacotron Spectrogram Feature Prediction Network is fine tuned with two 1.6 h speech datasets for 100,000 learning iterations, producing two individual models. The two Speech datasets are completely identical in content other than their textual representation, one follows the standard English language, whereas the second is an English phonetic representation in order to study the effects on the learning processes. To test imitative abilities post-training, thirty lines of speech are recorded from a human to be imitated. The models then attempt to produce these voice lines themselves, and the acoustic fingerprint of the outputs are compared to the real human speech. On average, English notation achieves 27.36%, whereas Phonetic English notation achieves 35.31% similarity to a human being. This suggests that representation of English through the International Phonetic Alphabet serves as more useful data than written English language. Thus, it is suggested from these experiments that a phonetic-aware paradigm would improve the abilities of speech synthesis similarly to its effects in the field of speech recognition.

KW - Fine tune learning

KW - Fingerprint analysis

KW - Phonetic awareness

KW - Speech synthesis

KW - Tacotron

UR - http://link.springer.com/10.1007/978-3-030-29933-0_23

UR - http://www.scopus.com/inward/record.url?scp=85072853384&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-29933-0_23

DO - 10.1007/978-3-030-29933-0_23

M3 - Conference contribution

SN - 978-3-030-29932-3

VL - 1043

T3 - Advances in Intelligent Systems and Computing

SP - 271

EP - 282

BT - Advances in Computational Intelligence Systems - Contributions Presented at the 19th UK Workshop on Computational Intelligence, 2019

A2 - Ju, Zhaojie

A2 - Zhou, Dalin

A2 - Gegov, Alexander

A2 - Yang, Longzhi

A2 - Yang, Chenguang

PB - Springer

ER -

Bird JJ, Ekárt A, Faria DR. Phoneme Aware Speech Synthesis via Fine Tune Transfer Learning with a Tacotron Spectrogram Prediction Network. In Ju Z, Zhou D, Gegov A, Yang L, Yang C, editors, Advances in Computational Intelligence Systems - Contributions Presented at the 19th UK Workshop on Computational Intelligence, 2019. Vol. 1043. Springer. 2019. p. 271-282. (Advances in Intelligent Systems and Computing). https://doi.org/10.1007/978-3-030-29933-0_23