Inproceedings,

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

R. Sutton, {. Szepesvári, and H. Maei.
NIPS, page 1609--1616. (2008)

Abstract

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD's quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.

BibTeX key: sutton2008
entry type: inproceedings
booktitle: NIPS
year: 2008
pages: 1609--1616
crossref: NIPS21
ee: http://books.nips.cc/papers/files/nips21/NIPS2008_0421.pdf
date-added: 2010-08-28 17:38:14 -0600
pdf: papers/gtdnips08.pdf
bibsource: DBLP, http://dblp.uni-trier.de
date-modified: 2010-11-25 00:50:58 -0700

BibSonomy

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on