Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP.
Kefan DongYuanhao WangXiaoyu ChenLiwei WangPublished in: CoRR (2019)
Keyphrases
- infinite horizon
- optimal policy
- markov decision processes
- markov decision process
- finite horizon
- state space
- policy iteration
- reinforcement learning
- dynamic programming
- long run
- discount factor
- partially observable
- average cost
- action selection
- optimal control
- stochastic demand
- single item
- decision problems
- finite state
- average reward
- function approximation
- learning algorithm
- reward function
- production planning
- multistage
- action space
- initial state
- model free