Minimax Off-Policy Evaluation for Multi-Armed Bandits.
Cong MaBanghua ZhuJiantao JiaoMartin J. WainwrightPublished in: IEEE Trans. Inf. Theory (2022)
Keyphrases
- multi armed bandits
- policy evaluation
- least squares
- temporal difference
- reinforcement learning
- evaluation function
- monte carlo
- model free
- policy iteration
- bandit problems
- multi armed bandit
- markov decision processes
- variance reduction
- markov decision problems
- function approximation
- statistical inference
- semi parametric
- optimal policy
- worst case
- reinforcement learning algorithms
- upper bound
- sample size
- partially observable markov decision processes
- decision processes
- action selection
- state space
- decision theoretic
- machine learning