Abstract
Stochastic optimization procedures, such as stochastic gradient descent, have
gained popularity for parameter estimation from large data sets. However,
standard stochastic optimization procedures cannot effectively combine
numerical stability with statistical and computational efficiency. Here, we
introduce an implicit stochastic gradient descent procedure, the iterates of
which are implicitly defined. Intuitively, implicit iterates shrink the
standard iterates. The amount of shrinkage depends on the observed Fisher
information matrix, which does not need to be explicitly computed in practice,
thus increasing stability without increasing the computational burden. When
combined with averaging, the proposed procedure achieves statistical efficiency
as well. We derive non-asymptotic bounds and characterize the asymptotic
distribution of implicit procedures. Our analysis also reveals the asymptotic
variance of a number of existing procedures. We demonstrate implicit stochastic
gradient descent by further developing theory for generalized linear models,
Cox proportional hazards, and M-estimation problems, and by carrying out
extensive experiments. Our results suggest that the implicit stochastic
gradient descent procedure is poised to become the workhorse of estimation with
large data sets.
Users
Please
log in to take part in the discussion (add own reviews or comments).