Login / Signup
ODIN: Disentangled Reward Mitigates Hacking in RLHF.
Lichang Chen
Chen Zhu
Davit Soselia
Jiuhai Chen
Tianyi Zhou
Tom Goldstein
Heng Huang
Mohammad Shoeybi
Bryan Catanzaro
Published in:
CoRR (2024)
Keyphrases
</>
reinforcement learning
intelligence and security informatics
security threats
penetration testing
data sets
case study
average reward
partially observable environments
real time
artificial intelligence
image processing
database systems
expert systems
information technology
dynamic programming
long run