Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.

Abhay Sheshadri Aidan Ewart Phillip Guo Aengus Lynch Cindy Wu Vivek Hebbar Henry Sleight Asa Cooper Stickland Ethan Perez Dylan Hadfield-Menell Stephen Casper

Published in: CoRR (2024)

Keyphrases