Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP.
Yuanhao WangKefan DongXiaoyu ChenLiwei WangPublished in: ICLR (2020)
Keyphrases
- infinite horizon
- optimal policy
- markov decision processes
- state space
- markov decision process
- policy iteration
- finite horizon
- reinforcement learning
- dynamic programming
- partially observable
- stochastic demand
- long run
- decision problems
- markov decision problems
- finite state
- average cost
- average reward
- reward function
- discount factor
- exploration strategy
- reinforcement learning algorithms
- learning algorithm
- action selection
- single item
- multi agent
- initial state
- optimal control