Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs.

Zihan Zhou Honghao Wei Lei Ying

Published in: CoRR (2023)

Keyphrases

model free
total reward
average reward
reinforcement learning algorithms
reinforcement learning
online learning
policy iteration
temporal difference
function approximation
worst case
optimal policy
markov decision processes
policy evaluation
control policy
online algorithms
dynamic programming
lower bound
long run
pattern recognition
neural network
online convex optimization
regret bounds
finite horizon
state space
optimal solution
training data