Comparison of Modeling Target in LSTM-RNN Duration Model

نویسندگان

  • Bo Chen
  • Jiahao Lai
  • Kai Yu
چکیده

Speech duration is an important component in statistical parameter speech synthesis(SPSS). In LSTM-RNN based SPSS system, the speech duration affects the quality of synthesized speech in two aspects, the prosody of speech and the position features in acoustic model. This paper investigated the effects of duration in LSTM-RNN based SPSS system. The performance of the acoustic models with position features at different levels are compared. Also, duration models with different network architectures are presented. A method to utilize the priori knowledge that the sum of state duration of a phoneme should be equal to the phone duration is proposed and proved to have better performance in both state duration and phone duration modeling. The result shows that acoustic model with state-level position features has better performance in acoustic modeling (especially in voice/unvoice classification), which means statelevel duration model still has its advantage and the duration models with the priori knowledge can result in better speech quality.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discrete Duration Model for Speech Synthesis

The acoustic model and the duration model are the two major components in statistical parametric speech synthesis (SPSS) systems. The neural network based acoustic model makes it possible to model phoneme duration at phone-level instead of state-level in conventional hidden Markov model (HMM) based SPSS systems. Since the duration of phonemes is countable value, the distribution of the phone-le...

متن کامل

What to Do Next: Modeling User Behaviors by Time-LSTM

Recently, Recurrent Neural Network (RNN) solutions for recommender systems (RS) are becoming increasingly popular. The insight is that, there exist some intrinsic patterns in the sequence of users’ actions, and RNN has been proved to perform excellently when modeling sequential data. In traditional tasks such as language modeling, RNN solutions usually only consider the sequential order of obje...

متن کامل

Semi-Supervised Training in Deep Learning Acoustic Model

We studied the semi-supervised training in a fully connected deep neural network (DNN), unfolded recurrent neural network (RNN), and long short-term memory recurrent neural network (LSTM-RNN) with respect to the transcription quality, the importance data sampling, and the training data amount. We found that DNN, unfolded RNN, and LSTM-RNN are increasingly more sensitive to labeling errors. For ...

متن کامل

Advanced LSTM: A Study about Better Time Dependency Modeling in Emotion Recognition

Long short-term memory (LSTM) is normally used in recurrent neural network (RNN) as basic recurrent unit. However, conventional LSTM assumes that the state at current time step depends on previous time step. This assumption constraints the time dependency modeling capability. In this study, we propose a new variation of LSTM, advanced LSTM (A-LSTM), for better temporal context modeling. We empl...

متن کامل

Investigation of Senone-based Long-Short Term Memory RNNs for Spoken Language Recognition

Recently, the integration of deep neural networks (DNNs) trained to predict senone posteriors with conventional language modeling methods has been proved effective for spoken language recognition. This work extends some of the senone-based DNN frameworks by replacing the DNN with the LSTM RNN. Two of these approaches use the LSTM RNN to generate features. The features are extracted from the rec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017