Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback.
Asaf CasselHaipeng LuoAviv RosenbergDmitry SotnikovPublished in: CoRR (2024)
Keyphrases
- regret bounds
- markov decision processes
- bandit problems
- reinforcement learning
- online learning
- upper confidence bound
- multi armed bandit
- lower bound
- state space
- multi armed bandit problems
- user feedback
- random sampling
- reward function
- linear regression
- expert advice
- minimax regret
- worst case
- linear systems
- markov chain
- confidence bounds
- least squares
- relevance feedback