Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
نویسندگان
چکیده
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.
منابع مشابه
Uncovering Latent Style Factors for Expressive Speech Synthesis
Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of “style tokens” in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We ...
متن کاملTacotron: Towards End-to-End Speech Synthesis
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. G...
متن کاملEmotional End-to-End Neural Speech Synthesizer
In this paper, we introduce an emotional speech synthesizer based on the recent end-to-end neural model, named Tacotron. Despite its benefits, we found that the original Tacotron suffers from the exposure bias problem and irregularity of the attention alignment. Later, we address the problem by utilization of context vector and residual connection at recurrent neural networks (RNNs). Our experi...
متن کاملStyle Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
In this work, we propose “global style tokens” (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-toend speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable “labels” they generate can be used to con...
متن کاملEnd-to-End Neural Speech Synthesis
In recent years, end-to-end neural networks have become the state of the art for speech recognition tasks and they are now widely deployed in industry (Amodei et al., 2016). Naturally, this has led to the creation of systems to do the opposite – end-to-end speech synthesis from raw text. Very recently, neural TTS systems have become highly competitive with their conventional counterparts, showi...
متن کامل