Discounted UCB1-tuned for Q-learning.
Koki SaitoAkira NotsuKatsuhiro HondaPublished in: SCIS&ISIS (2014)
Keyphrases
- optimal policy
- markov decision processes
- reinforcement learning
- state space
- discounted reward
- multi agent
- function approximation
- decision problems
- cooperative
- infinite horizon
- dynamic programming
- reinforcement learning algorithms
- bandit problems
- finite horizon
- policy iteration
- average reward
- markov decision process
- model free
- multi armed bandit
- action selection
- cash flow
- stochastic approximation
- learning algorithm
- reward function
- finite state
- average cost
- credit assignment
- state action
- sufficient conditions
- partially observable markov decision processes
- long run
- linear programming