Artikel in einem Konferenzbericht,

Reduced-Variance Payoff Estimation in Adversarial Bandit Problems

, und .
Proceedings of the ECML-2005 Workshop on Reinforcement Learning in Non-Stationary Environments, (2005)

Zusammenfassung

A natural way to compare learning methods in non-stationary environments is to compare their regret. In this paper we consider the regret of algorithms in adversarial multi-armed bandit problems. We propose several methods to improve the performance of the baseline exponentially weighted average forecaster by changing the payoff-estimation methods. We argue that improved performance can be achieved by constructing payoff estimation methods that produce estimates with low variance. Our arguments are backed up by both theoretical and empirical results. In fact, our empirical results show that significant performance gains are possible over the baseline algorithm.

Tags

Nutzer

  • @csaba

Kommentare und Rezensionen