Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction.
Haruka KiyoharaMasahiro NomuraYuta SaitoPublished in: CoRR (2024)
Keyphrases
- policy evaluation
- optimal policy
- partially observable markov decision processes
- least squares
- policy iteration
- markov decision processes
- temporal difference
- reinforcement learning
- monte carlo
- model free
- variance reduction
- markov decision problems
- td learning
- semi parametric
- random sampling
- function approximation
- decision problems
- dynamic programming
- dynamical systems
- markov chain
- markov decision process
- average reward
- state space
- evaluation function
- long run
- reward function
- reinforcement learning algorithms
- initial state
- lost sales