TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback.
Eunseop YoonHee Suk YoonSooHwan EomGunsoo HanDaniel Wontae NamDaejin JoKyoung-Woon OnMark Hasegawa-JohnsonSungwoong KimChang Dong YooPublished in: ACL (Findings) (2024)
Keyphrases
- fine grained
- reinforcement learning
- coarse grained
- access control
- action space
- tightly coupled
- state space
- machine learning
- eligibility traces
- function approximation
- partially observable environments
- databases
- reinforcement learning algorithms
- optimal policy
- learning algorithm
- dynamic programming
- massively parallel
- policy gradient
- model free
- knowledge base
- data lineage
- average reward
- web pages
- partially observable markov decision processes
- long run
- markov decision processes
- transfer learning
- co occurrence