Deterministic MDPs with Adversarial Rewards and Bandit Feedback.
Raman AroraOfer DekelAmbuj TewariPublished in: UAI (2012)
Keyphrases
- markov decision processes
- reinforcement learning
- bandit problems
- fully observable
- dynamic programming
- state space
- reward function
- partially observable
- multi armed bandits
- finite state
- markov decision problems
- multi agent
- optimal policy
- state and action spaces
- factored mdps
- policy iteration
- planning problems
- linear programming
- markov chain
- action space
- planning under uncertainty
- finite horizon
- decision theoretic planning
- infinite horizon
- random sampling
- average reward
- partial observability
- partially observable markov decision processes
- sufficient conditions
- real time dynamic programming
- decision problems