Abstract
Reinforcement learning (RL) typically defines a discount factor as part of
the Markov Decision Process. The discount factor values future rewards by an
exponential scheme that leads to theoretical convergence guarantees of the
Bellman equation. However, evidence from psychology, economics and neuroscience
suggests that humans and animals instead have hyperbolic time-preferences. In
this work we revisit the fundamentals of discounting in RL and bridge this
disconnect by implementing an RL agent that acts via hyperbolic discounting. We
demonstrate that a simple approach approximates hyperbolic discount functions
while still using familiar temporal-difference learning techniques in RL.
Additionally, and independent of hyperbolic discounting, we make a surprising
discovery that simultaneously learning value functions over multiple
time-horizons is an effective auxiliary task which often improves over a strong
value-based RL agent, Rainbow.
Users
Please
log in to take part in the discussion (add own reviews or comments).