Constitutional AI: Harmlessness from AI Feedback

Original paper · Bai et al 2022

RLHF has, despite much skepticism, proven pivotal in accelerating state of the art dialogue based language models. The question on everyone's mind lately has been how this scales in the long term. It's generally agreed upon that we've exahusted almost all high quality tokens available on the internet and the next frontier that's rising is the field of synthetic data. If we're going to scale models far beyond what they're at today, we're also going to need to scale alignment which is already an expensive process. Human feedback lacks ability to scale to magnitudes beyond today's horizons and as a result, researchers have looked to ways of cutting out the need for human labels. When it comes to synthetic data, Anthropic stands out as the most prominent player. They've been persistent in working without humans for quite some time and today I'd like to take a deeper dive into a method they coin Constitutional AI. A method to train AI systems in being helpful, honest, and harmless without human supervision; governed entirely through the specification of a short list of principles or instructions, i.e. a constitution. The motivations behind their work was:

Study the possibility of using AI systems to help supervise other AIs, thus scaling supervision.
Improve upon prior work in training harmless AI assistants by eliminating evasive responses, reducing tension between helpfulness and harmlessness.
Make the principles governing AI behavior more transparent.

The Constitutional AI (CAI) approach

CAI is an extreme form of scaled supervision - techniques that leverage AI to help humans efficiently supervise AI - that relies on a set of guiding principles as the only human input. The training process is two-fold where the first supervised stage gets the model "on-distribution" and the second RL stage refines and improves performance. Using SL as the first step in bootstrapping the RL process is standard, it counteracts the brittleness and difficulty in open form RL.

Supervised Stage. The first stage of CAI which someone coined "principled instruction correction" consist of prompting the model with harmful prompts and collecting the responses. The model is then asked to critique the responses based on a random principle from the constitution and revise the original response. This builds a supervised dataset which is used to finetune a pretrained language model. The main purpose of this phase is to easily and flexibly alter the distribution of the model’s responses, to reduce the need for exploration and the totall ength of training during the second RL phase.

RL Stage. The second stage mimics RLHF, except that human preference is replaced with AI feedback (i.e. RLAIF). The model trained through SL is asked to generate a number of responses to every harmful prompt in a dataset. The model is then asked to rank the responses according to a constitutional principle. This produces an AI-generated preference dataset which is used to train a preference model (PM). In the same vein as RLHF, the SL model is finetuned against the PM resulting in a policy trained by RLAIF.

Collective Constitutional AI

In a pioneering experiment, Anthropic, in collaboration with the Collective Intelligence Project, engaged around 1,000 Americans to help draft a constitution for an AI system. This initiative aimed to explore how democratic processes can influence AI development, particularly through Anthropic's Constitutional AI (CAI) method. Traditionally, Anthropic's AI models, like Claude, have been guided by an in-house constitution inspired by global ethical standards. This experiment was a departure, allowing public involvement in shaping AI values.

The public's input led to a constitution that both aligned with and diverged from Anthropic's original version. The experiment was groundbreaking in its approach, as it was one of the first times a language model's behavior was directly shaped by collective public deliberation. This effort represents a significant step towards making AI systems more transparent, representative, and accountable, illustrating the potential for democratic processes to shape the future of AI development.