Fine-Tuning Language Models from Human Preferences

Original paper · Ziegler et al 2019

Given the extensive advancements in natural language processing the authors would like to be able to use reinforcement learning to complex tasks where a good and bad result can only be judged by a asking humans. Reward learning enables such an application to tasks where reward is defined by human judgement, building a model by asking human questions. This paper combines the recent advancements in pretrained language models with human preference learning by fine-tuning with RL as opposed to supervised learning, using a reward model trained from human preferences. This is interesting because it can be useful in cases where downstream tasks lack significant supervised datasets or where programmatic reward functions are poor proxies for our true goal.

Method

Let's walk through the steps the authors go through to optimize a large language model from human preferences. An important note here before we begin is that the authors chose to evaluate this method on two specific downstream tasks; continuation of text in a way that matches target style and text summarization (CNN and/or TL;DR).

They begin with a 774M parameter version, p, of GPT-2, pretrained on their WebText dataset with a vocabulary size of 50,000. As seen in a lot of previous work, BPE is used to encode the text. For stylistic continuation the model is supervised fine-tuned on BookCorpus before RL fine-tuning on final task.
A policy $\pi = r$ is initialized which is fine-tuned to perform the final task well using RL. But, before we get there we need something to optimize for and in this case the authors use human labels to train a reward model which can then be optimized for.
The reward model, r, is random initialized from the final embedding output of the language model p then trained on the collected human samples. Training is run for one epoch to avoid overfitting on the limited amount of human data.
Finally the authors fine-tune $\pi$ via Proximal Policy Optimization using the reward model and some penalties to prevent the policy $\pi$ from moving too far from the range where r is valid. PPO runs with 2M episodes (x, y pairs), four PPO epochs per batch and one minibatch each.

Results

The RL fine-tuning approached proved successful on the continuation task as compared to the zero-shot language model baseline when evaluated by humans. It seemed to perform well above baselines with only 2.5k samples and saw little benefit from additional samples. The summarization tasks showed the policies turn into "smart copiers" where facts where extracted in smart manner from the original source rather than an abstractive generation. This led to extremely truthful models but ones that seem to not truly understand the source content.

Final thoughts

As the authors note I think its important to take a step beyond the traditional benchmarks as a way to validate natural language models. Great benefit can be derived from applying human reward learning to models but this also requires high quality human data.