Sign in

Feedback Loops With Language Models Drive In-Context Reward Hacking.

Alexander PanErik JonesMeena JagadeesanJacob Steinhardt
Published in: CoRR (2024)
Keyphrases