Login / Signup
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification.
Thomas Kwa
Drake Thomas
Adrià Garriga-Alonso
Published in:
CoRR (2024)
Keyphrases
</>
kl divergence
heavy tailed
kullback leibler divergence
generalized gaussian
information theoretic
gaussian distribution
probability density function
distance measure
gaussian mixture
posterior distribution
probability density
mutual information
information theory
denoising
marginal distributions