TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback.
Eunseop YoonHee Suk YoonSooHwan EomGunsoo HanDaniel Wontae NamDaejin JoKyoung-Woon OnMark A. Hasegawa-JohnsonSungwoong KimChang D. YooPublished in: CoRR (2024)
Keyphrases
- fine grained
- reinforcement learning
- coarse grained
- access control
- function approximation
- eligibility traces
- tightly coupled
- action space
- state space
- massively parallel
- learning algorithm
- temporal difference
- markov decision processes
- optimal policy
- dynamic programming
- machine learning
- agent receives
- data lineage
- reward signal
- user feedback