Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark.

Alexander PanJun Shern ChanAndy ZouNathaniel LiSteven BasartThomas WoodsideJonathan NgHanlin ZhangScott EmmonsDan Hendrycks
Published in: CoRR (2023)
Keyphrases