Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.
Abhay SheshadriAidan EwartPhillip GuoAengus LynchCindy WuVivek HebbarHenry SleightAsa Cooper SticklandEthan PerezDylan Hadfield-MenellStephen CasperPublished in: CoRR (2024)