Abstract
We study a novel multi-armed bandit problem that models the
challenge faced by a company wishing to explore new strategies to
maximize revenue whilst simultaneously maintaining their revenue
above a fixed baseline, uniformly over time. While previous work
addressed the problem under the weaker requirement of maintaining
the revenue constraint only at a given fixed time in the future, the
algorithms previously proposed are unsuitable due to their design
under the more stringent constraints. We consider both the
stochastic and the adversarial settings, where we propose, natural,
yet novel strategies and analyze the price for maintaining the
constraints. Amongst other things, we prove both high probability
and expectation bounds on the regret, while we also consider both
the problem of maintaining the constraints with high probability or
expectation. For the adversarial setting the price of maintaining
the constraint appears to be higher, at least for the algorithm
considered. A lower bound is given showing that the algorithm for
the stochastic setting is almost optimal. Empirical results
obtained in synthetic environments complement our theoretical
findings.
Users
Please
log in to take part in the discussion (add own reviews or comments).