Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback.
Tiancheng JinTal LancewickiHaipeng LuoYishay MansourAviv RosenbergPublished in: NeurIPS (2022)
Keyphrases
- bandit problems
- regret bounds
- multi armed bandits
- upper confidence bound
- multi armed bandit
- markov decision processes
- reward function
- multi armed bandit problems
- lower bound
- optimal policy
- state space
- online learning
- markov decision process
- finite state
- reinforcement learning
- worst case
- loss function
- multi agent
- expert advice
- decision problems
- utility function
- markov chain
- upper bound
- confidence bounds
- weighted majority
- total reward
- dynamic programming algorithms
- objective function
- relevance feedback
- minimax regret
- linear programming
- planning under uncertainty
- policy iteration
- user feedback