Online Regret Bounds for Undiscounted Continuous Reinforcement Learning.
Ronald OrtnerDaniil RyabkoPublished in: NIPS (2012)
Keyphrases
- reinforcement learning
- markov decision processes
- online learning
- regret bounds
- action space
- online convex optimization
- multi armed bandit
- policy iteration
- state space
- special case
- markov decision process
- function approximation
- temporal difference
- optimal control
- markov decision problems
- optimal policy
- upper bound
- probabilistic model