Article,

Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes

A. Agarwal, S. Kakade, J. Lee, and G. Mahajan.
(2019)cite arxiv:1908.00261Comment: Additional references and discussion of prior work.

Abstract

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); how they cope with approximation error due to using a restricted class of parametric policies; or their finite sample behavior. Such characterizations are important not only to compare these methods to their approximate value function counterparts (where such issues are relatively well understood, at least in the worst case), but also to help with more principled approaches to algorithm design. This work provides provable characterizations of computational, approximation, and sample size issues with regards to policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: 1) "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy, and 2) restricted policy classes, which may not contain the optimal policy and where we provide agnostic learning results. One insight of this work is in formalizing the importance how a favorable initial state distribution provides a means to circumvent worst-case exploration issues. Overall, these results place policy gradient methods under a solid theoretical footing, analogous to the global convergence guarantees of iterative value function based algorithms.

BibTeX key: agarwal2019optimality
entry type: article
year: 2019
url: http://arxiv.org/abs/1908.00261
note: cite arxiv:1908.00261Comment: Additional references and discussion of prior work

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@article{agarwal2019optimality, abstract = {Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); how they cope with approximation error due to using a restricted class of parametric policies; or their finite sample behavior. Such characterizations are important not only to compare these methods to their approximate value function counterparts (where such issues are relatively well understood, at least in the worst case), but also to help with more principled approaches to algorithm design. This work provides provable characterizations of computational, approximation, and sample size issues with regards to policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: 1) "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy, and 2) restricted policy classes, which may not contain the optimal policy and where we provide agnostic learning results. One insight of this work is in formalizing the importance how a favorable initial state distribution provides a means to circumvent worst-case exploration issues. Overall, these results place policy gradient methods under a solid theoretical footing, analogous to the global convergence guarantees of iterative value function based algorithms.}, added-at = {2019-09-12T17:40:12.000+0200}, author = {Agarwal, Alekh and Kakade, Sham M. and Lee, Jason D. and Mahajan, Gaurav}, biburl = {https://www.bibsonomy.org/bibtex/2d4270eef65595bf157a67898a952a1d8/kirk86}, description = {[1908.00261] Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes}, interhash = {2eed3ec594312911c5efbf0d3923764a}, intrahash = {d4270eef65595bf157a67898a952a1d8}, keywords = {approximate markov-processes optimization readings reinforcement-learning}, note = {cite arxiv:1908.00261Comment: Additional references and discussion of prior work}, timestamp = {2020-02-24T04:35:35.000+0100}, title = {Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes}, url = {http://arxiv.org/abs/1908.00261}, year = 2019 }

BibSonomy

Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on