Delay-Adapted Policy Optimization and Improved Regret for Adversarial MDP with Delayed Bandit Feedback.
Tal LancewickiAviv RosenbergDmitry SotnikovPublished in: CoRR (2023)
Keyphrases
- reward function
- optimal policy
- markov decision process
- multi armed bandit problems
- total reward
- markov decision processes
- reinforcement learning
- bandit problems
- optimization algorithm
- expected reward
- state space
- inverse reinforcement learning
- upper confidence bound
- markov decision problems
- optimization problems
- linear programming
- policy search
- lower bound
- utility function
- finite state
- regret bounds
- utility elicitation
- state and action spaces
- partially observable
- finite horizon
- dynamic programming
- reinforcement learning algorithms
- infinite horizon
- random sampling
- partially observable markov decision processes
- average cost
- fluid model
- multi agent
- long run
- decision problems
- online learning
- neural network