Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL.
Miguel SuauMatthijs T. J. SpaanFrans A. OliehoekPublished in: CoRR (2023)
Keyphrases
- optimal policy
- reinforcement learning
- markov decision process
- action selection
- markov decision processes
- control policy
- policy search
- action space
- actor critic
- trajectory data
- reinforcement learning problems
- state space
- rl algorithms
- partially observable domains
- policy iteration
- infinite horizon
- dynamic programming
- autonomous learning
- policy gradient
- neural network
- reward function
- model free
- average reward
- control policies
- decision problems
- learning process
- dynamic environments
- policy evaluation
- reinforcement learning algorithms
- reinforcement learning methods
- finite state
- long run
- partially observable