Login / Signup
Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI.
Sebastian Pilarski
Slawomir Pilarski
Dániel Varró
Published in:
IEEE Trans. Artif. Intell. (2022)
Keyphrases
</>
optimal policy
dynamic programming
learning algorithm
reinforcement learning
average reward
long run
expected reward
np hard
markov chain
multi armed bandit
dynamic programming algorithms
convergence rate
graphical models
upper bound
state space
cost function
search algorithm
optimal solution