Optimistic posterior sampling for reinforcement learning: worst-case regret bounds.
Shipra AgrawalRandy JiaPublished in: NIPS (2017)
Keyphrases
- reinforcement learning
- multi armed bandit
- worst case
- regret bounds
- lower bound
- upper bound
- sample size
- state space
- markov chain monte carlo
- probability distribution
- learning algorithm
- optimal policy
- model free
- random sampling
- learning process
- probabilistic model
- posterior distribution
- np hard
- model selection
- generative model
- online learning
- markov decision processes
- bayesian framework
- machine learning