Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark.
Alexander PanJun Shern ChanAndy ZouNathaniel LiSteven BasartThomas WoodsideJonathan NgHanlin ZhangScott EmmonsDan HendrycksPublished in: CoRR (2023)