Tom Lieberum

Publication Activity (10 Years)

Years Active: 2021-2024
Publications (10 Years): 9

Top Topics

Sparse Reconstruction

Multiple Choice

Sparse Representation

Top Venues

NeurIPS (Competition and Demos)

Publications

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders. CoRR (2024)
Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda
Improving Dictionary Learning with Gated Sparse Autoencoders. CoRR (2024)
Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Grégoire Delétang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca D. Dragan, Rohin Shah, Allan Dafoe, Toby Shevlane
Evaluating Frontier Models for Dangerous Capabilities. CoRR (2024)
János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda
AtP*: An efficient and scalable method for localizing LLM behaviour to components. CoRR (2024)
Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla. CoRR (2023)
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt
Progress measures for grokking via mechanistic interpretability. ICLR (2023)
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt
Progress measures for grokking via mechanistic interpretability. CoRR (2023)
Rohin Shah, Steven H. Wang, Cody Wild, Stephanie Milani, Anssi Kanervisto, Vinicius G. Goecks, Nicholas R. Waytowich, David Watkins-Valls, Bharat Prakash, Edmund Mills, Divyansh Garg, Alexander Fries, Alexandra Souly, Jun Shern Chan, Daniel del Castillo, Tom Lieberum
Retrospective on the 2021 BASALT Competition on Learning from Human Feedback. CoRR (2022)
Rohin Shah, Steven H. Wang, Cody Wild, Stephanie Milani, Anssi Kanervisto, Vinicius G. Goecks, Nicholas R. Waytowich, David Watkins-Valls, Bharat Prakash, Edmund Mills, Divyansh Garg, Alexander Fries, Alexandra Souly, Jun Shern Chan, Daniel del Castillo, Tom Lieberum
Retrospective on the 2021 MineRL BASALT Competition on Learning from Human Feedback. NeurIPS (Competition and Demos) (2021)