Login / Signup

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification.

Thomas KwaDrake ThomasAdrià Garriga-Alonso
Published in: CoRR (2024)
Keyphrases