Multi-turn Reinforcement Learning from Preference Human Feedback.
Lior ShaniAviv RosenbergAsaf CasselOran LangDaniele CalandrielloAvital ZiporiHila NogaOrgad KellerBilal PiotIdan SzpektorAvinatan HassidimYossi MatiasRémi MunosPublished in: CoRR (2024)