Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned.
Deep GanguliLiane LovittJackson KernionAmanda AskellYuntao BaiSaurav KadavathBen MannEthan PerezNicholas SchieferKamal NdousseAndy JonesSam BowmanAnna ChenTom ConerlyNova DasSarmaDawn DrainNelson ElhageSheer El ShowkStanislav FortZac Hatfield-DoddsTom HenighanDanny HernandezTristan HumeJosh JacobsonScott JohnstonShauna KravecCatherine OlssonSam RingerEli Tran-JohnsonDario AmodeiTom BrownNicholas JosephSam McCandlishChris OlahJared KaplanJack ClarkPublished in: CoRR (2022)