Minimax Off-Policy Evaluation for Multi-Armed Bandits.
Cong MaBanghua ZhuJiantao JiaoMartin J. WainwrightPublished in: CoRR (2021)
Keyphrases
- multi armed bandits
- policy evaluation
- least squares
- temporal difference
- reinforcement learning
- monte carlo
- evaluation function
- model free
- multi armed bandit
- policy iteration
- bandit problems
- markov decision processes
- variance reduction
- markov decision problems
- function approximation
- semi parametric
- optimal policy
- partially observable markov decision processes
- reinforcement learning algorithms
- linear programming
- linear regression
- statistical inference
- support vector machine
- machine learning
- decision making