DNN-Based Speech Synthesis: Importance of Input Features and Training Data
نویسندگان
چکیده
Deep neural networks (DNNs) have been recently introduced in speech synthesis. In this paper, an investigation on the importance of input features and training data on speaker dependent (SD) DNN-based speech synthesis is presented. Various aspects of the training procedure of DNNs are investigated in this work. Additionally, several training sets of different size (i.e., 13.5, 3.6 and 1.5 h of speech) are evaluated.
منابع مشابه
An investigation of context clustering for statistical speech synthesis with deep neural network
The state-of-the-art DNN speech synthesis system directly maps linguistic input to acoustic output and voice quality improvement over the conventional MSD-GMM-HMM synthesis system has been reported. DNN-based speech synthesis system does not require context clustering as in GMM-HMM systems and this was believed to be the main advantage and contributor to performance improvement. Our previous wo...
متن کاملA comparison of speech synthesis systems based on GPR, HMM, and DNN with a small amount of training data
In this paper, we evaluate a framework of statistical parametric speech synthesis based on Gaussian process regression (GPR) and compare it with those based on hidden Markov model (HMM) and deep neural network (DNN). Recently, for the purpose of improving the performance of HMM-based speech synthesis, novel frameworks using deep architectures have been proposed and have shown their effectivenes...
متن کاملAn Investigation of DNN-Based Speech Synthesis Using Speaker Codes
Recent studies have shown that DNN-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, an open problem remains as to whether the synthesized speech quality can be improved by utilizing a multi-speaker speech corpus. To address this problem, this paper proposes DNN-based speech synthesis using speaker codes as a simple met...
متن کاملUncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features
Speech enhancement is an important front-end technique to improve automatic speech recognition (ASR) in noisy environments. However, the wrong noise suppression of speech enhancement often causes additional distortions in speech signals, which degrades the ASR performance. To compensate the distortions, ASR needs to consider the uncertainty of enhanced features, which can be achieved by using t...
متن کاملSequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis
Feed-forward deep neural networks (DNNs) based text-tospeech (TTS) synthesis, which employs a multi-layered structure to exploit the statistical correlations between rich contextual information and the corresponding acoustic features, has been shown to outperform a decision tree based, GMM-HMM counterpart. However, the DNN-based TTS training has not taken the whole sequence, i.e., sentence, int...
متن کامل