Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction.
Haruka KiyoharaMasahiro NomuraYuta SaitoPublished in: WWW (2024)
Keyphrases
- policy evaluation
- optimal policy
- least squares
- partially observable markov decision processes
- policy iteration
- reinforcement learning
- markov decision processes
- temporal difference
- markov decision problems
- variance reduction
- monte carlo
- model free
- semi parametric
- function approximation
- state space
- markov decision process
- finite state
- td learning
- dynamic programming
- decision problems
- dynamical systems
- markov chain
- random sampling
- belief state
- partially observable
- linear programming
- decision processes
- long run
- average cost
- infinite horizon
- sample size
- importance sampling
- active learning
- sample path
- machine learning