Sign in

Discovering Language Model Behaviors with Model-Written Evaluations.

Ethan PerezSam RingerKamile LukosiuteKarina NguyenEdwin ChenScott HeinerCraig PettitCatherine OlssonSandipan KunduSaurav KadavathAndy JonesAnna ChenBenjamin MannBrian IsraelBryan SeethorCameron McKinnonChristopher OlahDa YanDaniela AmodeiDario AmodeiDawn DrainDustin LiEli Tran-JohnsonGuro KhundadzeJackson KernionJames LandisJamie KerrJared MuellerJeeyoon HyunJoshua LandauKamal NdousseLandon GoldbergLiane LovittMartin LucasMichael SellittoMiranda ZhangNeerav KingslandNelson ElhageNicholas JosephNoemí MercadoNova DasSarmaOliver RauschRobin LarsonSam McCandlishScott JohnstonShauna KravecSheer El ShowkTamera LanhamTimothy Telleen-LawtonTom BrownTom HenighanTristan HumeYuntao BaiZac Hatfield-DoddsJack ClarkSamuel R. BowmanAmanda AskellRoger GrosseDanny HernandezDeep GanguliEvan HubingerNicholas SchieferJared Kaplan
Published in: ACL (Findings) (2023)
Keyphrases