Login / Signup

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs.

Xin LaiZhuotao TianYukang ChenSenqiao YangXiangru PengJiaya Jia
Published in: CoRR (2024)
Keyphrases