Sign in

Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders.

Luke MarksAmir AbdullahLuna MendezRauno ArikePhilip H. S. TorrFazl Barez
Published in: CoRR (2023)
Keyphrases