Login / Signup
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States.
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Yongbin Li
Published in:
CoRR (2024)
Keyphrases
</>
hidden states
hidden markov models
hidden variables
conditional random fields
exponential family
hidden state
generative model
graphical models
markov model
dynamic bayesian networks
machine learning
similarity measure
maximum entropy