Abstract
We provide theoretical investigations into off-policy evaluation in
reinforcement learning using function approximators for (marginalized)
importance weights and value functions. Our contributions include: (1) A new
estimator, MWL, that directly estimates importance ratios over the state-action
distributions, removing the reliance on knowledge of the behavior policy as in
prior work (Liu et al., 2018). (2) Another new estimator, MQL, obtained by
swapping the roles of importance weights and value-functions in MWL. MQL has an
intuitive interpretation of minimizing average Bellman errors and can be
combined with MWL in a doubly robust manner. (3) Several additional results
that offer further insights into these methods, including the sample complexity
analyses of MWL and MQL, their asymptotic optimality in the tabular setting,
how the learned importance weights depend the choice of the discriminator
class, and how our methods provide a unified view of some old and new
algorithms in RL.
Users
Please
log in to take part in the discussion (add own reviews or comments).