Sublinear Optimal Policy Value Estimation in Contextual Bandits.
Weihao KongEmma BrunskillGregory ValiantPublished in: AISTATS (2020)
Keyphrases
- optimal policy
- markov decision processes
- finite horizon
- reinforcement learning
- dynamic programming
- state space
- decision problems
- infinite horizon
- long run
- state dependent
- finite state
- sufficient conditions
- multistage
- average cost
- reward function
- markov decision process
- control policies
- average reward
- lost sales
- asymptotically optimal
- bayesian reinforcement learning
- inventory models
- policy iteration
- serial inventory systems
- inventory level
- stochastic demand