Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback.
Wei ShenRui ZhengWenYu ZhanJun ZhaoShihan DouTao GuiQi ZhangXuanjing HuangPublished in: EMNLP (Findings) (2023)
Keyphrases
- reinforcement learning
- human operators
- markov decision processes
- state space
- human interaction
- neural network
- total length
- reinforcement learning algorithms
- learning process
- human faces
- user engagement
- action selection
- data transmission
- human behavior
- mathematical model
- optimal policy
- active learning
- multi agent
- learning algorithm
- machine learning