Learning to summarize from human feedback

Original paper ยท Stiennon et al 2020

The people at OpenAI return with a follow-up to a paper that my previous blog post covered - "Fine-Tuning Language Models from Human Preferences" just one year after the original. Last time we noted that the fine-tuned human reward models were successful on a stylistic continuation task but had more trouble with the summarization tasks, essentially learning to become "smart copiers". Well this paper focuses solely on the summarization task applied primarily to the TL;DR dataset but they also show that their results transfer to the news domain without any domain specific fine-tuning - demonstrated on CNN/Daily Mail news articles. For the most part the methodology behind this paper is similar to that of its predecessor so I will skip most of those details and instead focus on / point out what was new this time around.

The authors make an interesting statement in the papers opening lines, they note that current language models applied to specific tasks are often fine-tuned using supervised learning, maximizing a log probability of a set of human demonstrations. This approach leads a misalignment between the fine-tuning objective - maximizing a likelihood of human-written text - and what we're actually looking for - generating high-quality outputs as determined by humans. Such misalignment occurs because the maximum likelihood objective has no distinction between important errors (making up facts) and unimportant errors (e.g. selecting the correct synonym in context). Now that we've once again motivated the work lets jump into what was actually done in this paper.

Dataset

For this project the authors chose to train using the TL;DR summarization dataset from Reddit which is a collection of posts from various subreddits including summarizes (TL;DRs) written by the original poster. This is a nice dataset because it captures a vide variety of topics and there are a total of 3 million posts in the dataset which means it can be cleaned for quality and topic representation and still end up being 120k large. Additionally, the authors favour TL;DR over the CNN/Daily Mail dataset because simple extractive baselines are very effective on those type of articles which means the model will be biased towards extractive behavior, something that was apparent in the last paper.

Human labelers

The authors wish to address a discrepancy that was perceived during the FTLMHP paper where the model wanted to learn a quality that was deemed strong by the human labelers but not by the researchers. To do this they implement two changes in the human feedback process, namely 1) switch to complete offline-mode where the authors alternate between sending large batches of comparison data to the human labelers and re-training the models 2) they maintain a stronger hands-on connection with the human labelers.

Models

This time around the authors significantly step up the model size moving from 774M GPT-2 to 1.3B and 6.7B versions of GPT-3. These models are used as zero-shot baselines. Further, the authors fine-tune these models via supervised learning to predict summaries from the TL;DR dataset. These fine-tuned models are then used to generate initial sample summaries for human comparisons, to initialize the policy and reward models, and as baseline for evaluation. I'm not quite sure why they moved over to only using fine-tuned versions of the pretrain model as initializers for the policy as this was not the case in the last paper? Anyway moving on the reward model is initialized from the supervised baseline with the important modification of outputting a scalar value. This model is trained to predict which summary y_1, y_2 is better as judged by a human, given post x. The goal here is to create a model that inputs a sequence of text and returns a scalar reward which should numerically represent the human preference. In this case the underlying model is a LM but that does not need be the case. Either way the output being a scalar reward is crucial for the integration of this into the RLHF pipeline. Finally there's the policy, remember that the authors call the model optimized for human feedback the "policy" so the policy in this case is still a LLM. A post is sampled from the dataset and the policy generates a summary for said post which is then fed through the reward model and treated as a reward for the entire summary that is ultimately maximized using PPO. In RL terms an episode begins when a new post is sampled, from there on each BPE token represents a time step and the episode terminates when the policy outputs the EOS token after which the reward calculated (using the reward model). There is still a penalty, in form of KL divergence, on the difference between the supervised baseline and the policy.

Results

The authors find that the policies trained with human feedback are largely proffered over much larger supervised models. For example the 1.3B human feedback model significantly outperforms a supervised model 10x its size. The 6.7B model in turn outperforms the 1.3B version meaning that this approach benefits from scale. Additionally, on OOD news articles the RLHF models seem to generalize well with performance almost as strong as equally sized supervised models.

Final thoughts

Since this paper came out RLHF has become a staple in LLMs especially the ones directed at human interactions so it's really interesting to see where it came from. It seemed like the approach of going completely offline with their training was a great benefit to performance and overall I felt like the paper was pretty easy to follow. Everything made sense and I admire the authors for setting new standards when it comes to aligning language models with human preferences.