Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment.
Jiaxiang LiSiliang ZengHoi-To WaiChenliang LiAlfredo GarcíaMingyi HongPublished in: CoRR (2024)
Keyphrases
- data analysis
- human experts
- training data
- reinforcement learning
- data structure
- data processing
- human subjects
- prior knowledge
- statistical analysis
- synthetic data
- data sources
- high quality
- learned models
- data quality
- knowledge acquisition
- experimental data
- domain experts
- learning models
- missing data
- data sets
- online learning
- image data
- learning algorithm
- data collection
- learning systems
- supervised learning
- data distribution
- raw data
- data points
- xml documents
- pairwise
- hidden variables