Putting RL back in RLHFWe are excited to introduce the RL...

Putting RL back in RLHF
We are excited to introduce the RLOO (REINFORCE Leave One-Out) Trainer in TRL. As an alternative to PPO, RLOO is a new online RLHF training algorithm designed to be more accessible and easier to implement. In particular, RLOO requires less GPU memory and takes less wall time to converge. As shown in the figures below:

🤑RLOO uses approximately 50-70% less vRAM than PPO, depending on the model size
🚀RLOO runs 2x faster than PPO with 1B models and up to 3x faster than PPO with 6.9B models.
🔥RLOO performs competitively to PPO in terms of the response win rate (judged by GPT4) and consistently outperforms popular offline methods like DPO.
With RLOO, we bring Reinforcement Learning back into RLHF, enabling the community to explore online RL methods more easily. This is exciting because more and more studies have shown that online RL is more effective than offline methods such as DPO (https://arxiv.org/abs/2402.04792, https://arxiv.org/abs/2405.08448).