Effects of Training Data Variety in Generating Glottal Pulses from Acoustic Features with DNNs
نویسندگان
چکیده
Glottal volume velocity waveform, the acoustical excitation of voiced speech, cannot be acquired through direct measurements in normal production of continuous speech. Glottal inverse filtering (GIF), however, can be used to estimate the glottal flow from recorded speech signals. Unfortunately, the usefulness of GIF algorithms is limited since they are sensitive to noise and call for high-quality recordings. Recently, efforts have been taken to expand the use of GIF by training deep neural networks (DNNs) to learn a statistical mapping between framelevel acoustic features and glottal pulses estimated by GIF. This framework has been successfully utilized in statistical speech synthesis in the form of the GlottDNN vocoder which uses a DNN to generate glottal pulses to be used as the synthesizer’s excitation waveform. In this study, we investigate how the DNN-based generation of glottal pulses is affected by training data variety. The evaluation is done using both objective measures as well as subjective listening tests of synthetic speech. The results suggest that the performance of the glottal pulse generation with DNNs is affected particularly by how well the training corpus suits GIF: processing low-pitched male speech and sustained phonations shows better performance than processing high-pitched female voices or continuous speech.
منابع مشابه
Generative Adversarial Network-Based Glottal Waveform Model for Statistical Parametric Speech Synthesis
Recent studies have shown that text-to-speech synthesis quality can be improved by using glottal vocoding. This refers to vocoders that parameterize speech into two parts, the glottal excitation and vocal tract, that occur in the human speech production apparatus. Current glottal vocoders generate the glottal excitation waveform by using deep neural networks (DNNs). However, the squared error-b...
متن کاملLanguage Adaptive DNNs for Improved Low Resource Speech Recognition
Deep Neural Network (DNN) acoustic models are commonly used in today’s state-of-the-art speech recognition systems. As neural networks are a data driven method, the amount of available training data directly impacts the performance. In the past, several studies have shown that multilingual training of DNNs leads to improvements, especially in resource constrained tasks in which only limited tra...
متن کاملReducing Mismatch in Training of DNN-Based Glottal Excitation Models in a Statistical Parametric Text-to-Speech System
Neural network-based models that generate glottal excitation waveforms from acoustic features have been found to give improved quality in statistical parametric speech synthesis. Until now, however, these models have been trained separately from the acoustic model. This creates mismatch between training and synthesis, as the synthesized acoustic features used for the excitation model input diff...
متن کاملHMM-based Finnish text-to-speech system utilizing glottal inverse filtering
This paper describes an HMM-based speech synthesis system that utilizes glottal inverse filtering for generating natural sounding synthetic speech. In the proposed system, speech is first parametrized into spectral and excitation features using a glottal inverse filtering based method. The parameters are fed into an HMM system for training and then generated from the trained HMM according to te...
متن کاملAnalyzing the Effect of Channel Mismatch on the Sri Language Recognition Evaluation 2015 System
We present the work done by our group for the 2015 language recognition evaluation (LRE) organized by the National Institute of Standards and Technology (NIST). The focus of this evaluation was the development of language recognition systems for clusters of closely related languages using training data released by NIST. This training data contained a highly imbalanced sample from the languages ...
متن کامل