Misc,

Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

V. Aggarwal, M. Cotescu, N. Prateek, J. Lorenzo-Trueba, and R. Barra-Chicote.
(2019)cite arxiv:1911.12760Comment: Accepted to ICASSP 2020.

Abstract

We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational AutoEncoder (VAE) and a Householder Flow. The proposed system provides a 22% KL-divergence reduction while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9% relative naturalness improvement over standard Neural Text-to-Speech, while also improving the perceived emotional intensity (59 compared to the 55 of neutral speech).

BibTeX key: aggarwal2019using
entry type: misc
year: 2019
url: http://arxiv.org/abs/1911.12760
note: cite arxiv:1911.12760Comment: Accepted to ICASSP 2020

BibSonomy

Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on