Login / Signup

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.

Carson DenisonMonte MacDiarmidFazl BarezDavid DuvenaudShauna KravecSamuel MarksNicholas SchieferRyan SoklaskiAlex TamkinJared KaplanBuck ShlegerisSamuel R. BowmanEthan PerezEvan Hubinger
Published in: CoRR (2024)
Keyphrases