Deterministic MDPs with Adversarial Rewards and Bandit Feedback
Raman AroraOfer DekelAmbuj TewariPublished in: CoRR (2012)
Keyphrases
- markov decision processes
- reinforcement learning
- bandit problems
- fully observable
- optimal policy
- state space
- reward function
- multi armed bandits
- finite state
- planning problems
- decision problems
- factored mdps
- planning under uncertainty
- policy iteration
- partially observable
- multi agent
- random sampling
- stationary policies
- reinforcement learning algorithms
- dynamic programming
- markov decision problems
- finite horizon
- action space
- discounted reward
- decision theoretic planning
- markov chain
- semi markov decision processes
- learning algorithm
- partial observability
- multi armed bandit
- markov decision process
- user feedback
- heuristic search
- average reward
- average cost
- model free
- decision diagrams
- infinite horizon
- relevance feedback
- upper bound
- lower bound
- optimal solution
- real time dynamic programming
- machine learning
- stochastic shortest path