Login / Signup

Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment.

Yuu JinnaiTetsuro MorimuraKaito AriuKenshi Abe
Published in: CoRR (2024)
Keyphrases