Abstract
In this work we present a new agent architecture, called Reactor, which
combines multiple algorithmic and architectural contributions to produce an
agent with higher sample-efficiency than Prioritized Dueling DQN (Wang et al.,
2016) and Categorical DQN (Bellemare et al., 2017), while giving better
run-time performance than A3C (Mnih et al., 2016). Our first contribution is a
new policy evaluation algorithm called Distributional Retrace, which brings
multi-step off-policy updates to the distributional reinforcement learning
setting. The same approach can be used to convert several classes of multi-step
policy evaluation algorithms designed for expected value evaluation into
distributional ones. Next, we introduce the eta-leave-one-out policy
gradient algorithm which improves the trade-off between variance and bias by
using action values as a baseline. Our final algorithmic contribution is a new
prioritized replay algorithm for sequences, which exploits the temporal
locality of neighboring observations for more efficient replay prioritization.
Using the Atari 2600 benchmarks, we show that each of these innovations
contribute to both the sample efficiency and final agent performance. Finally,
we demonstrate that Reactor reaches state-of-the-art performance after 200
million frames and less than a day of training.
Users
Please
log in to take part in the discussion (add own reviews or comments).