Style-Specific Phrasing in Speech Synthesis

نویسندگان

  • Alok Parlikar
  • Alan W Black
  • Florian Metze
  • Ian Lane
  • Kishore Prahallad
چکیده

People pause between words and sentences when they speak. They pause to emphasize content, or to make an utterance more understandable, or just to take a breath. A speech synthesizer should also insert similar pauses to sound natural. The process of inserting prosodic breaks in an utterance is called Phrasing. Phrasing is a crucial step during speech synthesis because other models of prosody depend on it. Phrasing also helps characterize styles of speech, and synthesizers must adapt their phrasing to different speaking styles. This thesis presents a data-driven grammar-based approach that can be used to build style-specific phrasing models. We automatically label phrase breaks from speech data and use features over acoustic syntax in our modeling. Experimental results, both objective and subjective, show that these models are better than the prior state-of-art across various speaking styles. This thesis presents a minimum error-rate training approach to improve the phrasing models by optimizing them directly towards the evaluation criterion: the F-measure. This framework also allows us to define a knob that can be used to vary the number of phrase breaks produced in an utterance. This can be useful when changing the speaking rate. This thesis also discusses modeling not just the placement of phrase breaks, but also their duration. Corpus analysis shows that durations of breaks vary quite significantly between different styles, and we present methods with which this variation can be captured in a way that is perceptually better. The presented phrasing methods can have a broader impact on intonation models and can enhance the intelligibility of the synthesis of machine translation output. These methods can also be extended to “low-resource” scenarios, such as when building voices for uncommon languages, or for languages that do not have a standardized orthography.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Grammar Based Approach to Style Specific Phrase Prediction

We present an approach to style specific phrasing for Text-toSpeech (TTS) systems. We formulate the problem of phrase break prediction (or phrasing) as generation of a sequence of breaks (B) and non-breaks (NB) after each word in a sentence. We use prosodic breaks in speech data to build shallow parses over corresponding text. We then learn a grammar that can predict these shallow prosodic pars...

متن کامل

Automatic Building of Synthetic Voices from Audio Books

Current state-of-the-art text-to-speech systems produce intelligible speech but lack the prosody of natural utterances. Building better models of prosody involves development of prosodically rich speech databases. However, development of such speech databases requires a large amount of effort and time. An alternative is to exploit story style monologues (long speech files) in audio books. These...

متن کامل

Several Aspects of Machine-Driven Phrasing in Text-to-Speech Systems

The article discusses differences between a priori and a posteriori phrasing and their importance in the task of automatic prosodic phrasing in text-to-speech systems. On several examples it illustrates shortcomings of common evaluation of a priori phrasing performance using a posteriori phrasing of referential corpus data. The paper also proposes and evaluates a method for a priori phrasing ba...

متن کامل

Linguistic Processor Training on Speaker Data for Unit Selection Text-to-Speech

This paper describes an approach to synthesizing personalized speech while maintaining not only speaker voice but also speaker pronunciation peculiarities. Personalization is realized by means of pronunciation models trained on speaker data contained in his/her speech database. Untrained models allow to synthesize speech in neutral normative style. On the segmental level, the transcription mode...

متن کامل

Minimum error rate training for phrasing in speech synthesis

Phrase break prediction models in speech synthesis are classifiers that predict whether or not each word boundary is a prosodic break. These classifiers are generally trained to optimize the likelihood of prediction, and their performance is evaluated in terms of classification accuracy. We propose a minimum error rate training method for phrase break prediction. We combine multiple phrasing mo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013