Exploration in Reward Machines with Low Regret.

Hippolyte Bourel Anders Jonsson Odalric-Ambrym Maillard Mohammad Sadegh Talebi

Published in: AISTATS (2023)

Keyphrases

bandit problems
reinforcement learning
lower bound
online learning
high levels
expected reward
neural network
worst case
reward function
support vector
decision problems
long run
weighted majority
minimax regret