Reinforcement Learning Algorithms for MDPs -- A Survey
{. Szepesvári. TR09-13. Department of Computing Science, University of Alberta, (2009)
Abstract
This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actor-critic methods.
%0 Report
%1 szepesvari2009
%A Szepesvári, Cs.
%D 2009
%K Monte-Carlo PAC-learning, Q-learning, active actor-critic approximation, bias-variance difference function gradient gradient, learning, least-squares methods, natural online optimization, overfitting, planning, policy reinforcement simulation simulation, stochastic temporal tradeoff, two-timescale
%N TR09-13
%T Reinforcement Learning Algorithms for MDPs -- A Survey
%X This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actor-critic methods.
@techreport{szepesvari2009,
abstract = {This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actor-critic methods.},
added-at = {2020-03-17T03:03:01.000+0100},
author = {Szepesv{\'a}ri, {Cs}.},
biburl = {https://www.bibsonomy.org/bibtex/2371da89943e923222af3e77cb7e832e0/csaba},
date-added = {2010-08-28 17:38:14 -0600},
date-modified = {2010-09-03 00:43:23 -0600},
institution = {Department of Computing Science, University of Alberta},
interhash = {d185e0f8b90fd623c1d1e5eaee8be535},
intrahash = {371da89943e923222af3e77cb7e832e0},
keywords = {Monte-Carlo PAC-learning, Q-learning, active actor-critic approximation, bias-variance difference function gradient gradient, learning, least-squares methods, natural online optimization, overfitting, planning, policy reinforcement simulation simulation, stochastic temporal tradeoff, two-timescale},
number = {TR09-13},
pdf = {http://www.cs.ualberta.ca/system/files/tech_report/2009/TR09-13.pdf},
timestamp = {2020-03-17T03:03:01.000+0100},
title = {Reinforcement Learning Algorithms for {MDP}s -- A Survey},
year = 2009
}