Combinatorial Bandits for Maximum Value Reward Function under Max Value-Index Feedback.
Yiliu WangWei ChenMilan VojnovicPublished in: CoRR (2023)
Keyphrases
- reward function
- inverse reinforcement learning
- state space
- reinforcement learning
- markov decision processes
- optimal policy
- reinforcement learning algorithms
- multiple agents
- initially unknown
- hierarchical reinforcement learning
- transition probabilities
- generative model
- markov decision process
- transition model
- small number of iterations
- machine learning
- function approximation
- information extraction
- dynamic programming
- learning algorithm