Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback.
Tiancheng JinTal LancewickiHaipeng LuoYishay MansourAviv RosenbergPublished in: CoRR (2022)
Keyphrases
- bandit problems
- upper confidence bound
- multi armed bandits
- markov decision processes
- regret bounds
- reward function
- multi armed bandit problems
- state space
- multi armed bandit
- online learning
- lower bound
- relevance feedback
- decision problems
- markov decision process
- optimal policy
- multi agent
- utility function
- loss function
- contextual bandit
- computational complexity
- reinforcement learning
- partially observable
- worst case
- transition probabilities
- learning algorithm
- expert advice