Discovering Language Model Behaviors with Model-Written Evaluations.
Ethan PerezSam RingerKamile LukosiuteKarina NguyenEdwin ChenScott HeinerCraig PettitCatherine OlssonSandipan KunduSaurav KadavathAndy JonesAnna ChenBen MannBrian IsraelBryan SeethorCameron McKinnonChristopher OlahDa YanDaniela AmodeiDario AmodeiDawn DrainDustin LiEli Tran-JohnsonGuro KhundadzeJackson KernionJames LandisJamie KerrJared MuellerJeeyoon HyunJoshua LandauKamal NdousseLandon GoldbergLiane LovittMartin LucasMichael SellittoMiranda ZhangNeerav KingslandNelson ElhageNicholas JosephNoemí MercadoNova DasSarmaOliver RauschRobin LarsonSam McCandlishScott JohnstonShauna KravecSheer El ShowkTamera LanhamTimothy Telleen-LawtonTom BrownTom HenighanTristan HumeYuntao BaiZac Hatfield-DoddsJack ClarkSamuel R. BowmanAmanda AskellRoger GrosseDanny HernandezDeep GanguliEvan HubingerNicholas SchieferJared KaplanPublished in: CoRR (2022)