Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning.
Hanhan ZhouTian LanVaneet AggarwalPublished in: CoRR (2023)
Keyphrases
- average reward
- policy evaluation
- policy iteration
- reinforcement learning
- markov decision processes
- model free
- optimal policy
- variance reduction
- policy gradient
- temporal difference
- state space
- least squares
- monte carlo
- function approximation
- importance sampling
- dynamic programming
- sample size
- particle filter
- search space
- learning algorithm