Abstract
Momentum based stochastic gradient methods such as heavy ball (HB) and
Nesterov's accelerated gradient descent (NAG) method are widely used in
practice for training deep networks and other supervised learning models, as
they often provide significant improvements over stochastic gradient descent
(SGD). Rigorously speaking, "fast gradient" methods have provable improvements
over gradient descent only for the deterministic case, where the gradients are
exact. In the stochastic case, the popular explanations for their wide
applicability is that when these fast gradient methods are applied in the
stochastic case, they partially mimic their exact gradient counterparts,
resulting in some practical gain. This work provides a counterpoint to this
belief by proving that there exist simple problem instances where these methods
cannot outperform SGD despite the best setting of its parameters. These
negative problem instances are, in an informal sense, generic; they do not look
like carefully constructed pathological instances. These results suggest (along
with empirical evidence) that HB or NAG's practical performance gains are a
by-product of mini-batching.
Furthermore, this work provides a viable (and provable) alternative, which,
on the same set of problem instances, significantly improves over HB, NAG, and
SGD's performance. This algorithm, referred to as Accelerated Stochastic
Gradient Descent (ASGD), is a simple to implement stochastic algorithm, based
on a relatively less popular variant of Nesterov's Acceleration. Extensive
empirical results in this paper show that ASGD has performance gains over HB,
NAG, and SGD.
Description
On the insufficiency of existing momentum schemes for Stochastic
Optimization
Links and resources
Tags
community