Discrete Duration Model for Speech Synthesis
نویسندگان
چکیده
The acoustic model and the duration model are the two major components in statistical parametric speech synthesis (SPSS) systems. The neural network based acoustic model makes it possible to model phoneme duration at phone-level instead of state-level in conventional hidden Markov model (HMM) based SPSS systems. Since the duration of phonemes is countable value, the distribution of the phone-level duration is discrete given the linguistic features, which means the Gaussian hypothesis is no longer necessary. This paper provides an investigation on the performance of LSTM-RNN duration model that directly models the probability of the countable duration values given linguistic features using cross entropy as criteria. The multitask learning is also experimented at the same time, with a comparison to the standard LSTM-RNN duration model in objective and subjective measures. The result shows that directly modeling the discrete distribution has its benefit and multi-task model achieves better performance in phone-level duration modeling.
منابع مشابه
Presentation of K Nearest Neighbor Gaussian Interpolation and comparing it with Fuzzy Interpolation in Speech Recognition
Hidden Markov Model is a popular statisical method that is used in continious and discrete speech recognition. The probability density function of observation vectors in each state is estimated with discrete density or continious density modeling. The performance (in correct word recognition rate) of continious density is higher than discrete density HMM, but its computation complexity is very ...
متن کاملPresentation of K Nearest Neighbor Gaussian Interpolation and comparing it with Fuzzy Interpolation in Speech Recognition
Hidden Markov Model is a popular statisical method that is used in continious and discrete speech recognition. The probability density function of observation vectors in each state is estimated with discrete density or continious density modeling. The performance (in correct word recognition rate) of continious density is higher than discrete density HMM, but its computation complexity is very ...
متن کاملMultimodal analysis of speech and arm motion for prosody-driven synthesis of beat gestures
We propose a framework for joint analysis of speech prosody and arm motion towards automatic synthesis and realistic animation of beat gestures from speech prosody and rhythm. In the analysis stage, we first segment motion capture data and speech audio into gesture phrases and prosodic units via temporal clustering, and assign a class label to each resulting gesture phrase and prosodic unit. We...
متن کاملState-Correlated Duration Model for HMM-Based Speech Synthesis System1
This paper proposes a State-Correlated Duration model for HMM-based speech synthesis system. It uses an improved forward-backward algorithm to estimate the state-duration transition probability between the neighboring states. In the synthesis part, we determine the state duration taking account of the state-duration transition probability. Experiment results show that the speech we synthesized ...
متن کاملDuration prediction using multi-level model for GPR-based speech synthesis
This paper introduces frame-based Gaussian process regression (GPR) into phone/syllable duration modeling for Thai speech synthesis. The GPR model is designed for predicting framelevel acoustic features using corresponding frame information, which includes relative position in each unit of utterance structure and linguistic information such as tone type and part of speech. Although the GPR-base...
متن کامل