Abstract
Several variants of the Long Short-Term Memory
(LSTM) architecture for recurrent neural networks
have been proposed since its inception in
1995. In recent years, these networks have become
the state-of-the-art models for a variety of
machine learning problems. This has led to a renewed
interest in understanding the role and utility
of various computational components of typical
LSTM variants. In this paper, we present
the first large-scale analysis of eight LSTM variants
on three representative tasks: speech recognition,
handwriting recognition, and polyphonic
music modeling. The hyperparameters of all
LSTM variants for each task were optimized separately
using random search and their importance
was assessed using the powerful fANOVA
framework. In total, we summarize the results
of 5400 experimental runs (≈ 15 years of CPU
time), which makes our study the largest of its
kind on LSTM networks. Our results show
that none of the variants can improve upon the
standard LSTM architecture significantly, and
demonstrate the forget gate and the output activation
function to be its most critical components.
We further observe that the studied hyperparameters
are virtually independent and derive guidelines
for their efficient adjustment.
Users
Please
log in to take part in the discussion (add own reviews or comments).