Abstract
Policy gradient methods are among the most effective methods in challenging
reinforcement learning problems with large state and/or action spaces. However,
little is known about even their most basic theoretical convergence properties,
including: if and how fast they converge to a globally optimal solution (say
with a sufficiently rich policy class); how they cope with approximation error
due to using a restricted class of parametric policies; or their finite sample
behavior. Such characterizations are important not only to compare these
methods to their approximate value function counterparts (where such issues are
relatively well understood, at least in the worst case), but also to help with
more principled approaches to algorithm design.
This work provides provable characterizations of computational,
approximation, and sample size issues with regards to policy gradient methods
in the context of discounted Markov Decision Processes (MDPs). We focus on
both: 1) "tabular" policy parameterizations, where the optimal policy is
contained in the class and where we show global convergence to the optimal
policy, and 2) restricted policy classes, which may not contain the optimal
policy and where we provide agnostic learning results. One insight of this work
is in formalizing the importance how a favorable initial state distribution
provides a means to circumvent worst-case exploration issues. Overall, these
results place policy gradient methods under a solid theoretical footing,
analogous to the global convergence guarantees of iterative value function
based algorithms.
Users
Please
log in to take part in the discussion (add own reviews or comments).