Abstract
In this paper we propose a novel model for unconditional audio generation
based on generating one audio sample at a time. We show that our model, which
profits from combining memory-less modules, namely autoregressive multilayer
perceptrons, and stateful recurrent neural networks in a hierarchical structure
is able to capture underlying sources of variations in the temporal sequences
over very long time spans, on three datasets of different nature. Human
evaluation on the generated samples indicate that our model is preferred over
competing models. We also show how each component of the model contributes to
the exhibited performance.
Users
Please
log in to take part in the discussion (add own reviews or comments).