he N Implementation Details of RLHF with PPORLHF / ChatGP...

he N Implementation Details of RLHF with PPO
RLHF / ChatGPT has been a popular research topic these days. In our quest to research more on RLHF, this blog post attempts to do a reproduction of OpenAI’s 2019 original RLHF codebase at openai/lm-human-preferences. Despite its “tensorflow-1.x-ness,” OpenAI’s original codebase is very well-evaluated and benchmarked, making it a good place to study RLHF implementation engineering details.

We aim to:

reproduce OAI’s results in stylistic tasks and match the learning curves of openai/lm-human-preferences.
present a checklist of implementation details, similar to the spirit of The 37 Implementation Details of Proximal Policy Optimization; Debugging RL, Without the Agonizing Pain.
provide a simple-to-read and minimal reference implementation of RLHF;