Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark.
Alexander PanJun Shern ChanAndy ZouNathaniel LiSteven BasartThomas WoodsideHanlin ZhangScott EmmonsDan HendrycksPublished in: ICML (2023)