Sign in

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

Kevin LiuStephen CasperDylan Hadfield-MenellJacob Andreas
Published in: CoRR (2023)
Keyphrases