Abstract
Deep learning has seen tremendous success over the past decade in computer
vision, machine translation, and gameplay. This success rests in crucial ways
on gradient-descent optimization and the ability to learn parameters of a
neural network by backpropagating observed errors. However, neural network
architectures are growing increasingly sophisticated and diverse, which
motivates an emerging quest for even more general forms of differentiable
programming, where arbitrary parameterized computations can be trained by
gradient descent. In this paper, we take a fresh look at automatic
differentiation (AD) techniques, and especially aim to demystify the
reverse-mode form of AD that generalizes backpropagation in neural networks.
We uncover a tight connection between reverse-mode AD and delimited
continuations, which permits implementing reverse-mode AD purely via operator
overloading and without any auxiliary data structures. We further show how this
formulation of AD can be fruitfully combined with multi-stage programming
(staging), leading to a highly efficient implementation that combines the
performance benefits of deep learning frameworks based on explicit reified
computation graphs (e.g., TensorFlow) with the expressiveness of pure library
approaches (e.g., PyTorch).
Description
Demystifying Differentiable Programming: Shift/Reset the Penultimate
Backpropagator
Links and resources
Tags
community