Zusammenfassung
We introduce a new, high-throughput, synchronous, distributed, data-parallel,
stochastic-gradient-descent learning algorithm. This algorithm uses amortized
inference in a compute-cluster-specific, deep, generative, dynamical model to
perform joint posterior predictive inference of the mini-batch gradient
computation times of all worker-nodes in a parallel computing cluster. We show
that a synchronous parameter server can, by utilizing such a model, choose an
optimal cutoff time beyond which mini-batch gradient messages from slow workers
are ignored that maximizes overall mini-batch gradient computations per second.
In keeping with earlier findings we observe that, under realistic conditions,
eagerly discarding the mini-batch gradient computations of stragglers not only
increases throughput but actually increases the overall rate of convergence as
a function of wall-clock time by virtue of eliminating idleness. The principal
novel contribution and finding of this work goes beyond this by demonstrating
that using the predicted run-times from a generative model of cluster worker
performance to dynamically adjust the cutoff improves substantially over the
static-cutoff prior art, leading to, among other things, significantly reduced
deep neural net training times on large computer clusters.
Beschreibung
High Throughput Synchronous Distributed Stochastic Gradient Descent
Links und Ressourcen
Tags
Community