Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking.

Jacob Eisenstein Chirag Nagpal Alekh Agarwal Ahmad Beirami Alex D'Amour Dj Dvijotham Adam Fisch Katherine A. Heller Stephen Pfohl Deepak Ramachandran Peter Shaw Jonathan Berant

Published in: CoRR (2023)

Keyphrases