Stochastic optimization procedures, such as stochastic gradient descent, have gained popularity for parameter estimation from large data sets. However, standard stochastic optimization procedures cannot effectively combine numerical stability with statistical and computational efficiency. Here, we introduce an implicit stochastic gradient descent procedure, the iterates of which are implicitly defined. Intuitively, implicit iterates shrink the standard iterates. The amount of shrinkage depends on the observed Fisher information matrix, which does not need to be explicitly computed in practice, thus increasing stability without increasing the computational burden. When combined with averaging, the proposed procedure achieves statistical efficiency as well. We derive non-asymptotic bounds and characterize the asymptotic distribution of implicit procedures. Our analysis also reveals the asymptotic variance of a number of existing procedures. We demonstrate implicit stochastic gradient descent by further developing theory for generalized linear models, Cox proportional hazards, and M-estimation problems, and by carrying out extensive experiments. Our results suggest that the implicit stochastic gradient descent procedure is poised to become the workhorse of estimation with large data sets.


Implicit stochastic gradient descent

Links and resources

BibTeX key:
search on:

Comments and Reviews  

There is no review or comment yet. You can write one!


Cite this publication