Zusammenfassung
With the fast development of deep learning, it has become common to learn big
neural networks using massive training data. Asynchronous Stochastic Gradient
Descent (ASGD) is widely adopted to fulfill this task for its efficiency, which
is, however, known to suffer from the problem of delayed gradients. That is,
when a local worker adds its gradient to the global model, the global model may
have been updated by other workers and this gradient becomes "delayed". We
propose a novel technology to compensate this delay, so as to make the
optimization behavior of ASGD closer to that of sequential SGD. This is
achieved by leveraging Taylor expansion of the gradient function and efficient
approximation to the Hessian matrix of the loss function. We call the new
algorithm Delay Compensated ASGD (DC-ASGD). We evaluated the proposed algorithm
on CIFAR-10 and ImageNet datasets, and the experimental results demonstrate
that DC-ASGD outperforms both synchronous SGD and asynchronous SGD, and nearly
approaches the performance of sequential SGD.
Nutzer