Artikel in einem Konferenzbericht,

Value-iteration Based Fitted Policy Iteration: Learning with a Single Trajectory

A. Antos, {. Szepesvári, und R. Munos.
2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2007), Seite 330--337. IEEE, (2007)(Honolulu, Hawaii, Apr 1--5, 2007.).

Zusammenfassung

We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems when the training data is composed of the trajectory of some fixed behaviour policy. The algorithm studied is policy iteration where in successive iterations the action-value functions of the intermediate policies are obtained by means of approximate value iteration. PAC-style polynomial bounds are derived on the number of samples needed to guarantee near optimal performance. The bounds depend on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used. One of the main novelties of the paper is that new smoothness constraints are introduced thereby significantly extending the scope of previous results.

BibTeX-Schlüssel: antos2007a
Eintragstyp: inproceedings
Buchtitel: 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2007)
Jahr: 2007
Seiten: 330--337
Verlag: IEEE
pdf: papers/sapi_adprl4aa.pdf
date-modified: 2010-09-05 00:56:00 -0600
date-added: 2010-08-28 17:38:14 -0600
Hinweis: (Honolulu, Hawaii, Apr 1--5, 2007.)

BibSonomy

Value-iteration Based Fitted Policy Iteration: Learning with a Single Trajectory

Zusammenfassung

Tags

Nutzer

Kommentare und Rezensionenanzeigen / verbergen

Zitieren Sie diese Publikation

Mehr Zitationsstile

Suchen auf