Inproceedings,

DCM Bandits: Learning to Rank with Multiple Clicks

, , , and .
ICML, page 1215--1224. (2016)

Abstract

Search engines recommend a list of web pages. The user examines this list, from the first page to the last, and may click on multiple attractive pages. This type of user behavior can be modeled by the dependent click model (DCM). In this work, we propose DCM bandits, an online learning variant of the DCM model where the objective is to maximize the probability of recommending a satisfactory item. The main challenge of our problem is that the learning agent does not observe the reward. It only observes the clicks. This imbalance between the feedback and rewards makes our setting challenging. We propose a computationally-efficient learning algorithm for our problem, which we call dcmKL-UCB; derive gap-dependent upper bounds on its regret under reasonable assumptions; and prove a matching lower bound up to logarithmic factors. We experiment with dcmKL-UCB on both synthetic and real-world problems. Our algorithm outperforms a range of baselines and performs well even when our modeling assumptions are violated. To the best of our knowledge, this is the first regret-optimal online learning algorithm for learning to rank with multiple clicks in a cascade-like model.

Tags

Users

  • @brusilovsky
  • @csaba
  • @aho

Comments and Reviews