Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.

Published in: CoRR (2024)

Keyphrases