REBEL: A Regularization-Based Solution for Reward Overoptimization in Reinforcement Learning from Human Feedback.
Souradip ChakrabortyAmisha BhaskarAnukriti SinghPratap TokekarDinesh ManochaAmrit Singh BediPublished in: CoRR (2023)
Keyphrases
- reinforcement learning
- function approximation
- markov decision processes
- control policy
- model free
- machine learning
- reward function
- state space
- human subjects
- learning problems
- mixed norm
- motor skills
- temporal difference
- regularization parameter
- closed form
- image restoration
- dynamic programming
- learning process
- optimal solution
- learning algorithm