Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs.
Mohammad Sadegh TalebiOdalric-Ambrym MaillardPublished in: ALT (2018)
Keyphrases
- markov decision processes
- reinforcement learning
- multi armed bandit
- regret bounds
- markov decision problems
- policy iteration
- average reward
- optimal policy
- state space
- reinforcement learning algorithms
- markov decision process
- state and action spaces
- function approximation
- infinite horizon
- finite state
- model free
- partially observable
- machine learning
- dynamic programming
- stochastic games
- online learning
- lower bound
- action space
- learning algorithm
- average cost
- temporal difference
- linear regression
- reward function
- least squares
- learning process
- continuous state and action spaces