Sublinear Optimal Policy Value Estimation in Contextual Bandits.
Weihao KongGregory ValiantEmma BrunskillPublished in: CoRR (2019)
Keyphrases
- optimal policy
- markov decision processes
- decision problems
- finite horizon
- state space
- state dependent
- multistage
- long run
- reinforcement learning
- dynamic programming
- infinite horizon
- sufficient conditions
- finite state
- asymptotically optimal
- average reward
- bayesian reinforcement learning
- lost sales
- serial inventory systems
- markov decision process
- average cost
- policy iteration
- control policies
- reward function
- learning algorithm