COPR: Continual Human Preference Learning via Optimal Policy Regularization.
Han ZhangLin GuiYu LeiYuanzhao ZhaiYehong ZhangYulan HeHui WangYue YuKam-Fai WongBin LiangRuifeng XuPublished in: CoRR (2024)
Keyphrases
- optimal policy
- preference learning
- markov decision processes
- finite horizon
- reinforcement learning
- decision problems
- infinite horizon
- dynamic programming
- state dependent
- state space
- long run
- ordinal regression
- gaussian processes
- multistage
- pairwise comparison
- active learning
- lost sales
- sufficient conditions
- recommender systems
- linear programming
- support vector machine
- lower bound