Gradient temporal-difference learning for off-policy evaluation using emphatic weightings.
Jiaqing CaoQuan LiuFei ZhuQiming FuShan ZhongPublished in: Inf. Sci. (2021)
Keyphrases
- temporal difference learning
- policy evaluation
- temporal difference
- policy iteration
- function approximation
- reinforcement learning
- model free
- evaluation function
- markov decision processes
- monte carlo
- reinforcement learning algorithms
- policy gradient
- fixed point
- step size
- action selection
- markov decision process
- radial basis function
- optimal control
- supervised learning
- state space
- neural network
- infinite horizon