A fuzzy decision tree-based duration model for Standard Yorùbá text-to-speech synthesis

Ọdẹ´túnjí A. Odé´jọbí, Shun Ha Sylvia Wong, Anthony J. Beaumont

Research output: Contribution to journalArticle

Abstract

In this paper, we present syllable-based duration modelling in the context of a prosody model for Standard Yorùbá (SY) text-to-speech (TTS) synthesis applications. Our prosody model is conceptualised around a modular holistic framework. This framework is implemented using the Relational Tree (R-Tree) techniques. An important feature of our R-Tree framework is its flexibility in that it facilitates the independent implementation of the different dimensions of prosody, i.e. duration, intonation, and intensity, using different techniques and their subsequent integration. We applied the Fuzzy Decision Tree (FDT) technique to model the duration dimension. In order to evaluate the effectiveness of FDT in duration modelling, we have also developed a Classification And Regression Tree (CART) based duration model using the same speech data. Each of these models was integrated into our R-Tree based prosody model. We performed both quantitative (i.e. Root Mean Square Error (RMSE) and Correlation (Corr)) and qualitative (i.e. intelligibility and naturalness) evaluations on the two duration models. The results show that CART models the training data more accurately than FDT. The FDT model, however, shows a better ability to extrapolate from the training data since it achieved a better accuracy for the test data set. Our qualitative evaluation results show that our FDT model produces synthesised speech that is perceived to be more natural than our CART model. In addition, we also observed that the expressiveness of FDT is much better than that of CART. That is because the representation in FDT is not restricted to a set of piece-wise or discrete constant approximation. We, therefore, conclude that the FDT approach is a practical approach for duration modelling in SY TTS applications.
Original languageEnglish
Pages (from-to)325-349
Number of pages25
JournalComputer Speech and Language
Volume21
Issue number2
Early online date10 Jun 2006
DOIs
Publication statusPublished - Apr 2007

Fingerprint

Fuzzy Decision Tree
Duration Models
Text-to-speech
Speech Synthesis
Speech synthesis
Decision trees
Classification and Regression Trees
Prosody
Model
Modeling
Standards
Extrapolate
Evaluation
Expressiveness
Mean square error
Flexibility
Roots

Keywords

  • decision theory
  • fuzzy sets
  • least squares approximations
  • mathematical models
  • measurement errors
  • regression analysis
  • speech synthesis

Cite this

@article{9dbd6e694148405b9971a6cd6d9ba8f0,
title = "A fuzzy decision tree-based duration model for Standard Yor{\`u}b{\'a} text-to-speech synthesis",
abstract = "In this paper, we present syllable-based duration modelling in the context of a prosody model for Standard Yor{\`u}b{\'a} (SY) text-to-speech (TTS) synthesis applications. Our prosody model is conceptualised around a modular holistic framework. This framework is implemented using the Relational Tree (R-Tree) techniques. An important feature of our R-Tree framework is its flexibility in that it facilitates the independent implementation of the different dimensions of prosody, i.e. duration, intonation, and intensity, using different techniques and their subsequent integration. We applied the Fuzzy Decision Tree (FDT) technique to model the duration dimension. In order to evaluate the effectiveness of FDT in duration modelling, we have also developed a Classification And Regression Tree (CART) based duration model using the same speech data. Each of these models was integrated into our R-Tree based prosody model. We performed both quantitative (i.e. Root Mean Square Error (RMSE) and Correlation (Corr)) and qualitative (i.e. intelligibility and naturalness) evaluations on the two duration models. The results show that CART models the training data more accurately than FDT. The FDT model, however, shows a better ability to extrapolate from the training data since it achieved a better accuracy for the test data set. Our qualitative evaluation results show that our FDT model produces synthesised speech that is perceived to be more natural than our CART model. In addition, we also observed that the expressiveness of FDT is much better than that of CART. That is because the representation in FDT is not restricted to a set of piece-wise or discrete constant approximation. We, therefore, conclude that the FDT approach is a practical approach for duration modelling in SY TTS applications.",
keywords = "decision theory, fuzzy sets, least squares approximations, mathematical models, measurement errors, regression analysis, speech synthesis",
author = "Od{\'e}´jọb{\'i}, {Ọdẹ´t{\'u}nj{\'i} A.} and Wong, {Shun Ha Sylvia} and Beaumont, {Anthony J.}",
note = "Copyright 2008 Elsevier B.V., All rights reserved.",
year = "2007",
month = "4",
doi = "10.1016/j.csl.2006.06.005",
language = "English",
volume = "21",
pages = "325--349",
journal = "Computer Speech and Language",
issn = "0885-2308",
publisher = "Academic Press Inc.",
number = "2",

}

A fuzzy decision tree-based duration model for Standard Yorùbá text-to-speech synthesis. / Odé´jọbí, Ọdẹ´túnjí A.; Wong, Shun Ha Sylvia; Beaumont, Anthony J.

In: Computer Speech and Language, Vol. 21, No. 2, 04.2007, p. 325-349.

Research output: Contribution to journalArticle

TY - JOUR

T1 - A fuzzy decision tree-based duration model for Standard Yorùbá text-to-speech synthesis

AU - Odé´jọbí, Ọdẹ´túnjí A.

AU - Wong, Shun Ha Sylvia

AU - Beaumont, Anthony J.

N1 - Copyright 2008 Elsevier B.V., All rights reserved.

PY - 2007/4

Y1 - 2007/4

N2 - In this paper, we present syllable-based duration modelling in the context of a prosody model for Standard Yorùbá (SY) text-to-speech (TTS) synthesis applications. Our prosody model is conceptualised around a modular holistic framework. This framework is implemented using the Relational Tree (R-Tree) techniques. An important feature of our R-Tree framework is its flexibility in that it facilitates the independent implementation of the different dimensions of prosody, i.e. duration, intonation, and intensity, using different techniques and their subsequent integration. We applied the Fuzzy Decision Tree (FDT) technique to model the duration dimension. In order to evaluate the effectiveness of FDT in duration modelling, we have also developed a Classification And Regression Tree (CART) based duration model using the same speech data. Each of these models was integrated into our R-Tree based prosody model. We performed both quantitative (i.e. Root Mean Square Error (RMSE) and Correlation (Corr)) and qualitative (i.e. intelligibility and naturalness) evaluations on the two duration models. The results show that CART models the training data more accurately than FDT. The FDT model, however, shows a better ability to extrapolate from the training data since it achieved a better accuracy for the test data set. Our qualitative evaluation results show that our FDT model produces synthesised speech that is perceived to be more natural than our CART model. In addition, we also observed that the expressiveness of FDT is much better than that of CART. That is because the representation in FDT is not restricted to a set of piece-wise or discrete constant approximation. We, therefore, conclude that the FDT approach is a practical approach for duration modelling in SY TTS applications.

AB - In this paper, we present syllable-based duration modelling in the context of a prosody model for Standard Yorùbá (SY) text-to-speech (TTS) synthesis applications. Our prosody model is conceptualised around a modular holistic framework. This framework is implemented using the Relational Tree (R-Tree) techniques. An important feature of our R-Tree framework is its flexibility in that it facilitates the independent implementation of the different dimensions of prosody, i.e. duration, intonation, and intensity, using different techniques and their subsequent integration. We applied the Fuzzy Decision Tree (FDT) technique to model the duration dimension. In order to evaluate the effectiveness of FDT in duration modelling, we have also developed a Classification And Regression Tree (CART) based duration model using the same speech data. Each of these models was integrated into our R-Tree based prosody model. We performed both quantitative (i.e. Root Mean Square Error (RMSE) and Correlation (Corr)) and qualitative (i.e. intelligibility and naturalness) evaluations on the two duration models. The results show that CART models the training data more accurately than FDT. The FDT model, however, shows a better ability to extrapolate from the training data since it achieved a better accuracy for the test data set. Our qualitative evaluation results show that our FDT model produces synthesised speech that is perceived to be more natural than our CART model. In addition, we also observed that the expressiveness of FDT is much better than that of CART. That is because the representation in FDT is not restricted to a set of piece-wise or discrete constant approximation. We, therefore, conclude that the FDT approach is a practical approach for duration modelling in SY TTS applications.

KW - decision theory

KW - fuzzy sets

KW - least squares approximations

KW - mathematical models

KW - measurement errors

KW - regression analysis

KW - speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=33751252839&partnerID=8YFLogxK

U2 - 10.1016/j.csl.2006.06.005

DO - 10.1016/j.csl.2006.06.005

M3 - Article

VL - 21

SP - 325

EP - 349

JO - Computer Speech and Language

JF - Computer Speech and Language

SN - 0885-2308

IS - 2

ER -