Delay-Adapted Policy Optimization and Improved Regret for Adversarial MDP with Delayed Bandit Feedback.
Tal LancewickiAviv RosenbergDmitry SotnikovPublished in: ICML (2023)
Keyphrases
- optimal policy
- reward function
- markov decision process
- multi armed bandit problems
- total reward
- markov decision processes
- reinforcement learning
- bandit problems
- expected reward
- average reward
- online learning
- infinite horizon
- optimization algorithm
- optimization problems
- multi agent
- partially observable
- policy iteration
- regret bounds
- markov decision problems
- neural network
- lower bound
- game theory
- relevance feedback
- evolutionary algorithm
- multiple agents
- minimax regret
- state space
- objective function
- upper confidence bound
- worst case