Phone-Level Prosody Modelling With GMM-Based MDN for Diverse and Controllable Speech Synthesis
نویسندگان
چکیده
Generating natural speech with a diverse and smooth prosody pattern is challenging task. Although random sampling phone-level distribution has been investigated to generate different patterns, the diversity of generated still very limited far from what can be achieved by humans. This largely due use uni-modal distribution, such as single Gaussian, in prior works modelling. In this work, we propose novel approach that models prosodies GMM-based mixture density network(MDN) then extend it for multi-speaker TTS using speaker adaptation transforms Gaussian means variances. Furthermore, show clone reference components produce prosodies. Our experiments on LJSpeech LibriTTS dataset proposed method MDN not only achieves significantly better than both single-speaker TTS, but also provides naturalness. The cloning demonstrate similarity comparable recent fine-grained VAE while target better.
منابع مشابه
Prosody Modelling for Syllable-based Speech Synthesis
Prosody model used in the syllable based speech synthesizer DEMOSTHENES is described in the paper. The paper focuses on the segmental structure, especially on the segmentation into rhythm units (prosodic phrases). Relations between prosodic segments and sentence constituents are also discussed.
متن کاملProsody annotation for corpus based speech synthesis
The paper concerns prosody annotation especially for application in a corpus based speech synthesis. In order to establish the rules of automatic intonation modelling, phonetically labeled speech database of 4 hours has been perceptually and acoustically analyzed. The speech material included different text types and prosodically rich phrases. The annotation of the speech database consists in p...
متن کاملProsody Aware Word-Level Encoder Based on BLSTM-RNNs for DNN-Based Speech Synthesis
Recent studies have shown the effectiveness of the use of word vectors in DNN-based speech synthesis. However, these word vectors trained from a large amount of text generally carry not prosodic information, which is important information for speech synthesis, but semantic information. Therefore, if word vectors that take prosodic information into account can be obtained, it would be expected t...
متن کاملProsody control in HMM-based speech synthesis
In HMM-based speech synthesis, trained statistical models (context-dependent HMMs) are used to predict duration and generate parameters like mel-cepstral coefficients, log F0 values, and bandpass voicing strengths using the maximum likelihood parameter generation algorithm including global variance (Toda et al, 2007). In the later stages, F0 parameters, bandpass voicing strengths, and the five ...
متن کاملProsody modelling in Czech text-to-speech synthesis
This paper describes data-driven modelling of all three basic prosodic features – fundamental frequency, intensity and segmental duration – in the Czech text-to-speech system ARTIC. The fundamental frequency is generated by a model based on concatenation of automatically acquired intonational patterns. Intensity of synthesised speech is modelled by experimentally created rules which are in conf...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing
سال: 2022
ISSN: ['2329-9304', '2329-9290']
DOI: https://doi.org/10.1109/taslp.2021.3133205