Sign in
Alexander Pan
ORCID
Publication Activity (10 Years)
Years Active: 2021-2024
Publications (10 Years): 7
Top Topics
Long Term And Short Term
Power System
Feedback Loops
Multiarmed Bandit
Top Venues
CoRR
ICLR
ICML
</>
Publications
</>
Alexander Pan
,
Erik Jones
,
Meena Jagadeesan
,
Jacob Steinhardt
Feedback Loops With Language Models Drive In-Context Reward Hacking.
CoRR
(2024)
Alexander Pan
,
Jun Shern Chan
,
Andy Zou
,
Nathaniel Li
,
Steven Basart
,
Thomas Woodside
,
Hanlin Zhang
,
Scott Emmons
,
Dan Hendrycks
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark.
ICML
(2023)
Alexander Pan
,
Jun Shern Chan
,
Andy Zou
,
Nathaniel Li
,
Steven Basart
,
Thomas Woodside
,
Jonathan Ng
,
Hanlin Zhang
,
Scott Emmons
,
Dan Hendrycks
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark.
CoRR
(2023)
Andy Zou
,
Long Phan
,
Sarah Chen
,
James Campbell
,
Phillip Guo
,
Richard Ren
,
Alexander Pan
,
Xuwang Yin
,
Mantas Mazeika
,
Ann-Kathrin Dombrowski
,
Shashwat Goel
,
Nathaniel Li
,
Michael J. Byun
,
Zifan Wang
,
Alex Mallen
,
Steven Basart
,
Sanmi Koyejo
,
Dawn Song
,
Matt Fredrikson
,
J. Zico Kolter
,
Dan Hendrycks
Representation Engineering: A Top-Down Approach to AI Transparency.
CoRR
(2023)
Alexander Pan
,
Kush Bhatia
,
Jacob Steinhardt
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models.
ICLR
(2022)
Alexander Pan
,
Kush Bhatia
,
Jacob Steinhardt
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models.
CoRR
(2022)
Alexander Pan
,
Yongkyun Lee
,
Huan Zhang
,
Yize Chen
,
Yuanyuan Shi
Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training.
CoRR
(2021)