Combinatorial Bandits for Maximum Value Reward Function under Value-Index Feedback.
Yiliu WangWei ChenMilan VojnovicPublished in: ICLR (2024)
Keyphrases
- reward function
- inverse reinforcement learning
- markov decision processes
- reinforcement learning
- state space
- reinforcement learning algorithms
- optimal policy
- transition model
- multiple agents
- initially unknown
- hierarchical reinforcement learning
- minimax regret
- transition probabilities
- state variables
- markov decision process
- maximum entropy
- generative model
- markov chain
- probabilistic model
- decision making