C
search
search
reviewers
reviewers
feeds
feeds
assignments
assignments
settings
logout
Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems.
Xiang Ji
Huazheng Wang
Minshuo Chen
Tuo Zhao
Mengdi Wang
Published in:
CoRR (2023)
Keyphrases
</>
bandit problems
learning algorithm
learning process
decision making
human experts
machine learning
online learning
learning systems
action selection
multi agent
optimal solution
active learning
special case
language acquisition
human learning
preference learning