copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Online Markov Decision Processes under Bandit Feedback

G. Neu, A. György, {. Szepesvári, and A. Antos. IEEE Transactions on Automatic Control, 59 (3): 676--691 (December 2014)

Abstract

We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in hindsight in terms of the total reward received. Specifically, in each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is an algorithm with an expected regret of $O(T^2/3ln T)$. In this paper, assuming that stationary policies mix uniformly fast, we show that after $T$ time steps, the expected regret of this algorithm (more precisely, a slightly modified version thereof) is $O(T^1/2ln T)$, giving the first rigorously proven, essentially tight regret bound for the problem.

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Online Markov Decision Processes under Bandit Feedback

Abstract

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Online Markov Decision Processes under Bandit Feedback

Abstract

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Online Markov Decision Processes under Bandit Feedback

Comments and Reviews
(0)