A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward.
Susan A. MurphyYanzhen DengEric B. LaberHamid Reza MaeiRichard S. SuttonKatie WitkiewitzPublished in: CoRR (2016)
Keyphrases
- average reward
- actor critic
- dynamic programming
- gradient method
- optimal policy
- markov decision processes
- policy gradient
- cost function
- optimal solution
- linear programming
- np hard
- learning algorithm
- computational complexity
- objective function
- convergence rate
- long run
- temporal difference
- reinforcement learning algorithms
- policy iteration