A modular holistic approach to prosody modelling for Standard Yorùbá speech synthesis

Odétúnjí Odéjobí, S.H. Sylvia Wong, Anthony J. Beaumont

Research output: Contribution to journalArticle

Abstract

This paper presents a novel prosody model in the context of computer text-to-speech synthesis applications for tone languages. We have demonstrated its applicability using the Standard Yorùbá (SY) language. Our approach is motivated by the theory that abstract and realised forms of various prosody dimensions should be modelled within a modular and unified framework [Coleman, J.S., 1994. Polysyllabic words in the YorkTalk synthesis system. In: Keating, P.A. (Ed.), Phonological Structure and Forms: Papers in Laboratory Phonology III, Cambridge University Press, Cambridge, pp. 293–324]. We have implemented this framework using the Relational Tree (R-Tree) technique. R-Tree is a sophisticated data structure for representing a multi-dimensional waveform in the form of a tree. The underlying assumption of this research is that it is possible to develop a practical prosody model by using appropriate computational tools and techniques which combine acoustic data with an encoding of the phonological and phonetic knowledge provided by experts. To implement the intonation dimension, fuzzy logic based rules were developed using speech data from native speakers of Yorùbá. The Fuzzy Decision Tree (FDT) and the Classification and Regression Tree (CART) techniques were tested in modelling the duration dimension. For practical reasons, we have selected the FDT for implementing the duration dimension of our prosody model. To establish the effectiveness of our prosody model, we have also developed a Stem-ML prosody model for SY. We have performed both quantitative and qualitative evaluations on our implemented prosody models. The results suggest that, although the R-Tree model does not predict the numerical speech prosody data as accurately as the Stem-ML model, it produces synthetic speech prosody with better intelligibility and naturalness. The R-Tree model is particularly suitable for speech prosody modelling for languages with limited language resources and expertise, e.g. African languages. Furthermore, the R-Tree model is easy to implement, interpret and analyse.
Original languageEnglish
Pages (from-to)39-68
Number of pages30
JournalComputer Speech and Language
Volume22
Issue number1
DOIs
Publication statusPublished - Jan 2008

Fingerprint

Prosody
Speech Synthesis
Speech synthesis
holistic approach
Language
Decision Trees
Modeling
Fuzzy Decision Tree
Fuzzy Logic
Phonetics
Model
Decision trees
language
Population Groups
Acoustics
Standards
Classification and Regression Trees
Speech intelligibility
Text-to-speech
standard language

Bibliographical note

NOTICE: this is the author’s version of a work that was accepted for publication in Computer Speech and Language. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in ?d?´j?bí, ?d?´túnjí A.; Wong, S.H. Sylvia and Beaumont, Anthony J. (2008). A modular holistic approach to prosody modelling for Standard Yorùbá speech synthesis. Computer Speech and Language, 22 (1), pp. 39-68. DOI 10.1016/j.csl.2007.05.002

Keywords

  • speech synthesis
  • prosody modelling
  • Standard Yorùbá
  • tone languages
  • modular holistic model
  • relational trees

Cite this

@article{9717ddd9675b4ecca357d208def48a50,
title = "A modular holistic approach to prosody modelling for Standard Yor{\`u}b{\'a} speech synthesis",
abstract = "This paper presents a novel prosody model in the context of computer text-to-speech synthesis applications for tone languages. We have demonstrated its applicability using the Standard Yor{\`u}b{\'a} (SY) language. Our approach is motivated by the theory that abstract and realised forms of various prosody dimensions should be modelled within a modular and unified framework [Coleman, J.S., 1994. Polysyllabic words in the YorkTalk synthesis system. In: Keating, P.A. (Ed.), Phonological Structure and Forms: Papers in Laboratory Phonology III, Cambridge University Press, Cambridge, pp. 293–324]. We have implemented this framework using the Relational Tree (R-Tree) technique. R-Tree is a sophisticated data structure for representing a multi-dimensional waveform in the form of a tree. The underlying assumption of this research is that it is possible to develop a practical prosody model by using appropriate computational tools and techniques which combine acoustic data with an encoding of the phonological and phonetic knowledge provided by experts. To implement the intonation dimension, fuzzy logic based rules were developed using speech data from native speakers of Yor{\`u}b{\'a}. The Fuzzy Decision Tree (FDT) and the Classification and Regression Tree (CART) techniques were tested in modelling the duration dimension. For practical reasons, we have selected the FDT for implementing the duration dimension of our prosody model. To establish the effectiveness of our prosody model, we have also developed a Stem-ML prosody model for SY. We have performed both quantitative and qualitative evaluations on our implemented prosody models. The results suggest that, although the R-Tree model does not predict the numerical speech prosody data as accurately as the Stem-ML model, it produces synthetic speech prosody with better intelligibility and naturalness. The R-Tree model is particularly suitable for speech prosody modelling for languages with limited language resources and expertise, e.g. African languages. Furthermore, the R-Tree model is easy to implement, interpret and analyse.",
keywords = "speech synthesis, prosody modelling, Standard Yor{\`u}b{\'a}, tone languages, modular holistic model, relational trees",
author = "Od{\'e}t{\'u}nj{\'i} Od{\'e}job{\'i} and Wong, {S.H. Sylvia} and Beaumont, {Anthony J.}",
note = "NOTICE: this is the author’s version of a work that was accepted for publication in Computer Speech and Language. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in ?d?´j?b{\'i}, ?d?´t{\'u}nj{\'i} A.; Wong, S.H. Sylvia and Beaumont, Anthony J. (2008). A modular holistic approach to prosody modelling for Standard Yor{\`u}b{\'a} speech synthesis. Computer Speech and Language, 22 (1), pp. 39-68. DOI 10.1016/j.csl.2007.05.002",
year = "2008",
month = "1",
doi = "10.1016/j.csl.2007.05.002",
language = "English",
volume = "22",
pages = "39--68",
journal = "Computer Speech and Language",
issn = "0885-2308",
publisher = "Academic Press Inc.",
number = "1",

}

A modular holistic approach to prosody modelling for Standard Yorùbá speech synthesis. / Odéjobí, Odétúnjí; Wong, S.H. Sylvia; Beaumont, Anthony J.

In: Computer Speech and Language, Vol. 22, No. 1, 01.2008, p. 39-68.

Research output: Contribution to journalArticle

TY - JOUR

T1 - A modular holistic approach to prosody modelling for Standard Yorùbá speech synthesis

AU - Odéjobí, Odétúnjí

AU - Wong, S.H. Sylvia

AU - Beaumont, Anthony J.

N1 - NOTICE: this is the author’s version of a work that was accepted for publication in Computer Speech and Language. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in ?d?´j?bí, ?d?´túnjí A.; Wong, S.H. Sylvia and Beaumont, Anthony J. (2008). A modular holistic approach to prosody modelling for Standard Yorùbá speech synthesis. Computer Speech and Language, 22 (1), pp. 39-68. DOI 10.1016/j.csl.2007.05.002

PY - 2008/1

Y1 - 2008/1

N2 - This paper presents a novel prosody model in the context of computer text-to-speech synthesis applications for tone languages. We have demonstrated its applicability using the Standard Yorùbá (SY) language. Our approach is motivated by the theory that abstract and realised forms of various prosody dimensions should be modelled within a modular and unified framework [Coleman, J.S., 1994. Polysyllabic words in the YorkTalk synthesis system. In: Keating, P.A. (Ed.), Phonological Structure and Forms: Papers in Laboratory Phonology III, Cambridge University Press, Cambridge, pp. 293–324]. We have implemented this framework using the Relational Tree (R-Tree) technique. R-Tree is a sophisticated data structure for representing a multi-dimensional waveform in the form of a tree. The underlying assumption of this research is that it is possible to develop a practical prosody model by using appropriate computational tools and techniques which combine acoustic data with an encoding of the phonological and phonetic knowledge provided by experts. To implement the intonation dimension, fuzzy logic based rules were developed using speech data from native speakers of Yorùbá. The Fuzzy Decision Tree (FDT) and the Classification and Regression Tree (CART) techniques were tested in modelling the duration dimension. For practical reasons, we have selected the FDT for implementing the duration dimension of our prosody model. To establish the effectiveness of our prosody model, we have also developed a Stem-ML prosody model for SY. We have performed both quantitative and qualitative evaluations on our implemented prosody models. The results suggest that, although the R-Tree model does not predict the numerical speech prosody data as accurately as the Stem-ML model, it produces synthetic speech prosody with better intelligibility and naturalness. The R-Tree model is particularly suitable for speech prosody modelling for languages with limited language resources and expertise, e.g. African languages. Furthermore, the R-Tree model is easy to implement, interpret and analyse.

AB - This paper presents a novel prosody model in the context of computer text-to-speech synthesis applications for tone languages. We have demonstrated its applicability using the Standard Yorùbá (SY) language. Our approach is motivated by the theory that abstract and realised forms of various prosody dimensions should be modelled within a modular and unified framework [Coleman, J.S., 1994. Polysyllabic words in the YorkTalk synthesis system. In: Keating, P.A. (Ed.), Phonological Structure and Forms: Papers in Laboratory Phonology III, Cambridge University Press, Cambridge, pp. 293–324]. We have implemented this framework using the Relational Tree (R-Tree) technique. R-Tree is a sophisticated data structure for representing a multi-dimensional waveform in the form of a tree. The underlying assumption of this research is that it is possible to develop a practical prosody model by using appropriate computational tools and techniques which combine acoustic data with an encoding of the phonological and phonetic knowledge provided by experts. To implement the intonation dimension, fuzzy logic based rules were developed using speech data from native speakers of Yorùbá. The Fuzzy Decision Tree (FDT) and the Classification and Regression Tree (CART) techniques were tested in modelling the duration dimension. For practical reasons, we have selected the FDT for implementing the duration dimension of our prosody model. To establish the effectiveness of our prosody model, we have also developed a Stem-ML prosody model for SY. We have performed both quantitative and qualitative evaluations on our implemented prosody models. The results suggest that, although the R-Tree model does not predict the numerical speech prosody data as accurately as the Stem-ML model, it produces synthetic speech prosody with better intelligibility and naturalness. The R-Tree model is particularly suitable for speech prosody modelling for languages with limited language resources and expertise, e.g. African languages. Furthermore, the R-Tree model is easy to implement, interpret and analyse.

KW - speech synthesis

KW - prosody modelling

KW - Standard Yorùbá

KW - tone languages

KW - modular holistic model

KW - relational trees

UR - http://www.scopus.com/inward/record.url?scp=34548295430&partnerID=8YFLogxK

U2 - 10.1016/j.csl.2007.05.002

DO - 10.1016/j.csl.2007.05.002

M3 - Article

VL - 22

SP - 39

EP - 68

JO - Computer Speech and Language

JF - Computer Speech and Language

SN - 0885-2308

IS - 1

ER -