Online learning in MDPs with linear function approximation and bandit feedback.
Gergely NeuJulia OlkhovskayaPublished in: NeurIPS (2021)
Keyphrases
- function approximation
- online learning
- reinforcement learning
- temporal difference learning algorithms
- function approximators
- regret bounds
- markov decision processes
- temporal difference
- state space
- model free
- policy evaluation
- temporal difference learning
- markov decision problems
- learning tasks
- radial basis function
- e learning
- reinforcement learning problems
- policy iteration
- dynamic programming
- optimal policy
- multi agent
- policy search
- machine learning
- data mining
- linear programming
- supervised learning
- training data
- learning algorithm