Sign in

Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment.

Haoran WangKai Shu
Published in: CoRR (2023)
Keyphrases