Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
Evan HubingerCarson DenisonJesse MuMike LambertMeg TongMonte MacDiarmidTamera LanhamDaniel M. ZieglerTim MaxwellNewton ChengAdam S. JermynAmanda AskellAnsh RadhakrishnanCem AnilDavid DuvenaudDeep GanguliFazl BarezJack ClarkKamal NdousseKshitij SachanMichael SellittoMrinank SharmaNova DasSarmaRoger GrosseShauna KravecYuntao BaiZachary WittenMarina FavaroJan BraunerHolden KarnofskyPaul F. ChristianoSamuel R. BowmanLogan GrahamJared KaplanSören MindermannRyan GreenblattBuck ShlegerisNicholas SchieferEthan PerezPublished in: CoRR (2024)