Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game.
Sam ToyerOlivia WatkinsEthan Adrian MendesJustin SvegliatoLuke BaileyTiffany WangIsaac OngKarim ElmaaroufiPieter AbbeelTrevor DarrellAlan RitterStuart RussellPublished in: ICLR (2024)