Optimisation of phonetic aware speech recognition through multi-objective evolutionary algorithms

Jordan J. Bird; Elizabeth Wanner; Anikó Ekárt; Diego R. Faria

doi:10.1016/j.eswa.2020.113402

Optimisation of phonetic aware speech recognition through multi-objective evolutionary algorithms

Jordan J. Bird^*, Elizabeth Wanner, Anikó Ekárt, Diego R. Faria

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

Recent advances in the availability of computational resources allow for more sophisticated approaches to speech recognition than ever before. This study considers Artificial Neural Network and Hidden Markov Model methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet rather than the classical approach of the classification of whole words and phrases, with a specific focus on both single and multi-objective evolutionary optimisation of bioinspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico and the recordings are transformed into a static dataset of statistics by way of their Mel-Frequency Cepstral Coefficients (MFCC) at sliding window length of 200ms as well as a reshaped MFCC timeseries format for forecast-based models. An deep neural network with evolutionary optimised topology achieves 90.77% phoneme classification accuracy in comparison to the best HMM that achieves 86.23% accuracy with 150 hidden units, when only accuracy is considered in a single-objective optimisation approach. The obtained solutions are far more complex than the HMM taking around 248 seconds to train on powerful hardware versus 160 for the HMM. A multi-objective approach is explored due to this. In the multi-objective approaches of scalarisation presented, within which real-time resource usage is also considered towards solution fitness, far more optimal solutions are produced which train far quicker than the forecast approach (69 seconds) with classification ability retained (86.73%). Weightings towards either maximising accuracy or reducing resource usage from 0.1 to 0.9 are suggested depending on the resources available, since many future IoT devices and autonomous robots may have limited access to cloud resources at a premium in comparison to the GPU used in this experiment.

Original language	English
Article number	113402
Journal	Expert Systems with Applications
Volume	153
Early online date	24 Mar 2020
DOIs	https://doi.org/10.1016/j.eswa.2020.113402
Publication status	Published - 1 Sept 2020

Bibliographical note

Keywords

Applied hyperheuristics
Multi-objective evolutionary computation
Phoneme classification
Speech recognition

Access to Document

10.1016/j.eswa.2020.113402

Optimisation of phonetic aware speech recognition
© 2020, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/
Accepted author manuscript, 477 KBLicence: CC BY-NC-ND 3.0

Cite this

@article{6e81971c23594a8ea552b15f35a13f5a,

title = "Optimisation of phonetic aware speech recognition through multi-objective evolutionary algorithms",

abstract = "Recent advances in the availability of computational resources allow for more sophisticated approaches to speech recognition than ever before. This study considers Artificial Neural Network and Hidden Markov Model methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet rather than the classical approach of the classification of whole words and phrases, with a specific focus on both single and multi-objective evolutionary optimisation of bioinspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico and the recordings are transformed into a static dataset of statistics by way of their Mel-Frequency Cepstral Coefficients (MFCC) at sliding window length of 200ms as well as a reshaped MFCC timeseries format for forecast-based models. An deep neural network with evolutionary optimised topology achieves 90.77% phoneme classification accuracy in comparison to the best HMM that achieves 86.23% accuracy with 150 hidden units, when only accuracy is considered in a single-objective optimisation approach. The obtained solutions are far more complex than the HMM taking around 248 seconds to train on powerful hardware versus 160 for the HMM. A multi-objective approach is explored due to this. In the multi-objective approaches of scalarisation presented, within which real-time resource usage is also considered towards solution fitness, far more optimal solutions are produced which train far quicker than the forecast approach (69 seconds) with classification ability retained (86.73%). Weightings towards either maximising accuracy or reducing resource usage from 0.1 to 0.9 are suggested depending on the resources available, since many future IoT devices and autonomous robots may have limited access to cloud resources at a premium in comparison to the GPU used in this experiment.",

keywords = "Applied hyperheuristics, Multi-objective evolutionary computation, Phoneme classification, Speech recognition",

author = "Bird, {Jordan J.} and Elizabeth Wanner and Anik{\'o} Ek{\'a}rt and Faria, {Diego R.}",

note = "{\textcopyright} 2020, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/",

year = "2020",

month = sep,

day = "1",

doi = "10.1016/j.eswa.2020.113402",

language = "English",

volume = "153",

journal = "Expert Systems with Applications",

issn = "0957-4174",

publisher = "Elsevier",

}

TY - JOUR

T1 - Optimisation of phonetic aware speech recognition through multi-objective evolutionary algorithms

AU - Bird, Jordan J.

AU - Wanner, Elizabeth

AU - Ekárt, Anikó

AU - Faria, Diego R.

PY - 2020/9/1

Y1 - 2020/9/1

N2 - Recent advances in the availability of computational resources allow for more sophisticated approaches to speech recognition than ever before. This study considers Artificial Neural Network and Hidden Markov Model methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet rather than the classical approach of the classification of whole words and phrases, with a specific focus on both single and multi-objective evolutionary optimisation of bioinspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico and the recordings are transformed into a static dataset of statistics by way of their Mel-Frequency Cepstral Coefficients (MFCC) at sliding window length of 200ms as well as a reshaped MFCC timeseries format for forecast-based models. An deep neural network with evolutionary optimised topology achieves 90.77% phoneme classification accuracy in comparison to the best HMM that achieves 86.23% accuracy with 150 hidden units, when only accuracy is considered in a single-objective optimisation approach. The obtained solutions are far more complex than the HMM taking around 248 seconds to train on powerful hardware versus 160 for the HMM. A multi-objective approach is explored due to this. In the multi-objective approaches of scalarisation presented, within which real-time resource usage is also considered towards solution fitness, far more optimal solutions are produced which train far quicker than the forecast approach (69 seconds) with classification ability retained (86.73%). Weightings towards either maximising accuracy or reducing resource usage from 0.1 to 0.9 are suggested depending on the resources available, since many future IoT devices and autonomous robots may have limited access to cloud resources at a premium in comparison to the GPU used in this experiment.

AB - Recent advances in the availability of computational resources allow for more sophisticated approaches to speech recognition than ever before. This study considers Artificial Neural Network and Hidden Markov Model methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet rather than the classical approach of the classification of whole words and phrases, with a specific focus on both single and multi-objective evolutionary optimisation of bioinspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico and the recordings are transformed into a static dataset of statistics by way of their Mel-Frequency Cepstral Coefficients (MFCC) at sliding window length of 200ms as well as a reshaped MFCC timeseries format for forecast-based models. An deep neural network with evolutionary optimised topology achieves 90.77% phoneme classification accuracy in comparison to the best HMM that achieves 86.23% accuracy with 150 hidden units, when only accuracy is considered in a single-objective optimisation approach. The obtained solutions are far more complex than the HMM taking around 248 seconds to train on powerful hardware versus 160 for the HMM. A multi-objective approach is explored due to this. In the multi-objective approaches of scalarisation presented, within which real-time resource usage is also considered towards solution fitness, far more optimal solutions are produced which train far quicker than the forecast approach (69 seconds) with classification ability retained (86.73%). Weightings towards either maximising accuracy or reducing resource usage from 0.1 to 0.9 are suggested depending on the resources available, since many future IoT devices and autonomous robots may have limited access to cloud resources at a premium in comparison to the GPU used in this experiment.

KW - Applied hyperheuristics

KW - Multi-objective evolutionary computation

KW - Phoneme classification

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85083000399&partnerID=8YFLogxK

UR - https://www.sciencedirect.com/science/article/abs/pii/S0957417420302268?via%3Dihub

U2 - 10.1016/j.eswa.2020.113402

DO - 10.1016/j.eswa.2020.113402

M3 - Article

AN - SCOPUS:85083000399

SN - 0957-4174

VL - 153

JO - Expert Systems with Applications

JF - Expert Systems with Applications

M1 - 113402

ER -

Optimisation of phonetic aware speech recognition through multi-objective evolutionary algorithms

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this