Article,

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

A. Antos, {. Szepesvári, and R. Munos.
Machine Learning, 71 (1): 89--129 (April 2008)Published Online First: 14 Nov, 2007.
DOI: 10.1007/s10994-007-5038-2

Abstract

In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algo- rithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.

BibTeX key: anszemu:mlj07
entry type: article
year: 2008
month: apr
journal: Machine Learning
number: 1
pages: 89--129
volume: 71
date-added: 2011-07-26 09:54:32 -0600
bdsk-url-1: http://www.springerlink.com/content/h46m1152288681jt/
url3: http://www.szit.bme.hu/~antos/ps/anszmu_sapi_mlj.ps.gz
pdf: papers/sapi_mlj.pdf
url2: http://springer.r.delivery.net/r/r?2.1.Ee.2Tp.1gRdFJ.BwJPwi..N.ErIo.2yDO.LBOEfA00
date-modified: 2011-07-26 10:03:59 -0600
DOI: 10.1007/s10994-007-5038-2
url: http://www.springerlink.com/content/h46m1152288681jt/
note: Published Online First: 14 Nov, 2007

BibSonomy

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on