Publication: Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders.