Modelling the temporal structure of newsreaders' speech on neural networks for Estonian text-to-speech synthesis
نویسندگان
چکیده
Generation of natural-sounding synthetic speech from a text requires perfect control over the temporal structure of speech flow. The present paper describes an attempt to replace the rule-based durational model, hitherto used in Estonian text-tospeech synthesis, by neural networks (NN). For this aim, fluent speech of radio announcers and newsreaders was analysed and its temporal structure was modelled on neural networks. Analysis of pauses in extended material revealed that if a text is read out with a normal speech rate, it is quite possible to classify the pauses made, so that the results can be used in speech synthesis. For sound durations, certain characteristics of phone context as well as certain syllablelevel features were found to be the relevant input for an NN algorithm. For models of pause durations and positions, however, the prevalent features were variables characterizing text structure (punctuation marks and conjunctions).
منابع مشابه
Modelling Speech Temporal Structure for Estonian Text-to-speech Synthesis: Feature Selection
The article discusses the principles of selecting features for modelling the temporal structure of Estonian speech, using different types of read-out texts, with a view to text-tospeech synthesis (TTS). Feature selection is known to depend on certain general issues regulating speech temporal structure, as well as on some language specific aspects. The durational model of Estonian stands out for...
متن کاملشبکه عصبی پیچشی با پنجرههای قابل تطبیق برای بازشناسی گفتار
Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...
متن کاملمعرفی شبکه های عصبی پیمانه ای عمیق با ساختار فضایی-زمانی دوگانه جهت بهبود بازشناسی گفتار پیوسته فارسی
In this article, growable deep modular neural networks for continuous speech recognition are introduced. These networks can be grown to implement the spatio-temporal information of the frame sequences at their input layer as well as their labels at the output layer at the same time. The trained neural network with such double spatio-temporal association structure can learn the phonetic sequence...
متن کاملSpeech Emotion Recognition Using Scalogram Based Deep Structure
Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concat...
متن کاملInfluences of Contextual Predictability and Lexical Prosody on Estonian Word Duration
The article investigates how different factors such as word predictability and part of speech may affect word duration in Estonian speech. The material comes from corpora of read texts. On the example of the five most frequent words in the material (eesti 'Estonian', ei 'not', ja 'and', on 'is; are', see 'it; this') the correlation of the predictability and duration of words is studied. It is c...
متن کامل