Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits.
Hassan SaberFabien PesquerelOdalric-Ambrym MaillardMohammad Sadegh TalebiPublished in: ACML (2023)
Keyphrases
- regret bounds
- markov decision processes
- multi armed bandits
- lower bound
- online learning
- reinforcement learning
- worst case
- linear regression
- expert advice
- multi armed bandit
- upper bound
- state space
- dynamical systems
- reward function
- bandit problems
- dynamic model
- multi armed bandit problems
- factored mdps
- policy iteration
- markov decision process
- finite state
- optimal policy
- finite horizon
- decision diagrams
- decision theoretic planning
- average cost
- planning problems
- online convex optimization