Reinforcement Learning From Human Feedback (RLHF) For LLMs

Reinforcement Learning from Human Feedback (RLHF) unlocked the full potential of today’s large language models (LLMs).

By integrating human judgment into the training process, RLHF ensures that models not only produce coherent and useful outputs but also align more closely with human values, preferences, and expectations.

The RLHF process consists of three steps: collecting human feedback in the form of a preference dataset, training a reward model to mimic human preferences, and fine-tuning the LLM using the reward model. The last step is enabled by the Proximal Policy Optimization (PPO) algorithm.

Alternatives to RLHF include Constitutional AI where the model learns to critique itself whenever it fails to adhere to a predefined set of rules and Reinforcement Learning from AI Feedback (RLAIF) in which off-the-shelf LLMs replace humans as preference data providers.

Reinforcement Learning from Human Feedback (RLHF) has turned out to be the key to unlocking the full potential of today’s large language models (LLMs). There is arguably no better evidence for this than OpenAI’s GPT-3 model. It was released back in 2020, but it was only its RLHF-trained version dubbed ChatGPT that became an overnight sensation, capturing the attention of millions and setting a new standard for conversational AI.

Before RLHF, the LLM training process typically consisted of a pre-training stage in which the model learned the general structure of the language and a fine-tuning stage in which it learned to perform a specific task. By integrating human judgment as a third training stage, RLHF ensures that models not only produce coherent and useful outputs but also align more closely with human values, preferences, and expectations. It achieves this through a feedback loop where human evaluators rate or rank the model’s outputs, which is then used to adjust the model’s behavior.

This article explores the intricacies of RLHF. We will look at its importance for language modeling, analyze its inner workings in detail, and discuss the best practices for implementation.

Importance of RLHF in LLMs

When analyzing the importance of RLHF to language modeling, one could approach it from two different perspectives.

On the one hand, this technique has emerged as a response to the limitations of traditional supervised fine-tuning, such as reliance on static datasets often limited in scope, context, and diversity, as well as broader human values, ethics, or social norms. Additionally, traditional fine-tuning often struggles with tasks that involve subjective judgment or ambiguity, where there may be multiple valid answers. In such cases, a model might favor one answer over another based on the training data, even if the alternative might be more appropriate in a given context. RLHF provides a way to lift some of these limitations.

On the other hand, however, RLHF represents a paradigm shift in the fine-tuning of LLMs. It forms a standalone, transformative change in the evolution of AI rather than a mere incremental improvement over existing methods.

Let’s look at it from the latter perspective first.

The paradigm shift brought about by RLHF lies in the integration of human feedback directly into the training loop, enabling models to better align with human values and preferences. This approach prioritizes dynamic model-human interactions over static training datasets. By incorporating human insights throughout the training process, RLHF ensures that models are more context-aware and capable of handling the complexities of natural language.

I now hear you asking: “But how is injecting the human into the loop better than the traditional fine-tuning in which we train the model in a supervised fashion on a static dataset? Can’t we simply pass human preferences to the model by constructing a fine-tuning data set based on these preferences?“ That’s a fair question.

Consider succinctness as a preference for a text summarizing model. We could fine-tune a Large Language Model on concise summaries by training it in a supervised manner on the set of input-output pairs where input is the original text and output is the desired summary.

The problem here is that different summaries can be equally good, and different groups of people will have preferences as to what level of succinctness is optimal in different contexts. When relying solely on traditional supervised fine-tuning, the model might learn to generate concise summaries, but it won’t necessarily grasp the subtle balance between brevity and informativeness that different users might prefer. This is where RLHF offers a distinct advantage.

In RLHF, we train the model on the following data set:

Each example consists of the long input text, two alternative summaries, and a label that signals which of the two was preferred by a human annotator. By directly passing human preference to the model via the label indicating the “better” output, we can ensure it aligns with it properly.

Let’s discuss how this works in detail.

The RLHF process

The RLHF process consists of three steps:

Collecting human feedback.
Training a reward model.
Fine-tuning the LLM using the reward model.

The algorithm enabling the last step in the process is the Proximal Policy Optimization (PPO).

High-Level overview of Reinforcement Learning from Human Feeback (RLHF). A reward model is trained on a preference dataset that includes the input, alternative outputs, and a label indicating which of the outputs is preferable. The LLM is fine-tuned through reinforcement learning with Proximal Policy Optimization (PPO).

Collecting human feedback

The first step in RLHF is to collect human feedback in the so-called preference dataset. In its simplest form, each example in this dataset consists of a prompt, two different answers produced by the LLM as the response to this prompt, and an indicator for which of the two answers was deemed better by a human evaluator.

The specific dataset formats differ and are not too important. The schematic dataset shown above used four fields: Input text, Summary 1, Summary 2, and Preference. Anthropic’s hh-rlhf dataset uses a different format: two columns with the chosen and rejected version of a dialogue between a human and an AI assistant, where the prompt is the same in both cases.

An example entry from Anthropic’s hh-rlhf preference dataset. The left column contains the prompt and the better answer produced by the model. The right column contains the exact same prompt and the worse answer, as judged by a human evaluator. | Source

Regardless of the format, the information contained in the human preference data set is always the same: It’s not that one answer is good and the other is bad. It’s that one, albeit imperfect, is preferred over the other – it’s all about preference.

Now you might wonder why the labelers are asked to pick one of two responses instead of, say, scoring each response on a scale. The problem with scores is that they are subjective. Scores provided by different individuals, or even two scores from the same labeler but on different examples, are not comparable.

So how do the labelers decide which of the two responses to pick? This is arguably the most important nuance in RLHF. The labelers are offered specific instructions outlining the evaluation protocol. For example, they might be instructed to pick the answers that don’t use swear words, the ones that sound more friendly, or the ones that don’t offer any dangerous information. What the instructions tell the labelers to focus on is crucial to the RLHF-trained model, as it will only align with those human values that are contained within these instructions.

More advanced approaches to building a preference dataset might involve humans ranking more than two responses to the same prompt. Consider three different responses: A, B, and C.

Human annotators have ranked them as follows, where “1” is best, and “3” is worst:

A – 2

B – 1

C – 3

Out of these, we can create three pairs resulting in three training examples:

Preferred response	Non-preferred response

Training a reward model

Once we have our preference dataset in place, we can use it to train a reward model (RM).

The reward model is typically also an LLM, often encoder-only, such as BERT. During training, the RM receives three inputs from the preference dataset: the prompt, the winning response, and the losing response. It produces two outputs, called rewards, for each of the responses:

Training a reward model: the reward model is typically also an LLM, often encoder-only, such as BERT. During training, the RM receives three inputs from the preference dataset: the prompt, the winning response, and the losing response. It produces two outputs, called rewards, for each of the responses.

The training objective is to maximize the reward difference between the winning and losing response. An often-used loss function is the cross-entropy loss between the two rewards.

This way, the reward model learns to distinguish between more and less preferred responses, effectively ranking them based on the preferences encoded in the dataset. As the model continues to train, it becomes better at predicting which responses will likely be preferred by a human evaluator.

Once trained, the reward model serves as a simple regressor predicting the reward value for the given prompt-completion pair:

Fine-tuning the LLM with the reward model

The third and final RLHF stage is fine-tuning. This is where the reinforcement learning takes place.

The fine-tuning stage requires another dataset that is different from the preference dataset. It consists of prompts only, which should be similar to what we expect our LLM to deal with in production. Fine-tuning teaches the LLM to produce aligned responses for these prompts.

Specifically, the goal of fine-tuning is to train the LLM to produce completions that maximize the rewards given by the reward model. The training loop looks as follows:

Fine-tuning the LLM with the reward model: first, we pass a prompt from the training set to the LLM and generate a completion. Next, the prompt and the completion are passed to the reward model, which in turn predicts the reward. This reward is fed into an optimization algorithm such as PPO, which then adjusts the LLM’s weights in a direction resulting in a better RM-predicted reward for the given training example.

First, we pass a prompt from the training set to the LLM and generate a completion. Next, the prompt and the completion are passed to the reward model, which in turn predicts the reward. This reward is fed into an optimization algorithm such as PPO (more about it in the next section), which then adjusts the LLM’s weights in a direction resulting in a better RM-predicted reward for the given training example (not unlike gradient descent in traditional deep learning).

Proximal Policy Optimization (PPO)

One of the most popular optimizers for RLHF is the Proximal Policy Optimization algorithm or PPO. Let’s unpack this mouthful.

In the reinforcement learning context, the term “policy” refers to the strategy used by an agent to decide its actions. In the RLHF world, the policy is the LLM we are training which decides which tokens to generate in its responses. Hence, “policy optimization” means we are optimizing the LLM’s weights.

What about “proximal”? The term “proximal” refers to the key idea in PPO of making only small, controlled changes to the policy during training. This prevents an issue all too common in traditional policy gradient methods, where large updates to the policy can sometimes lead to significant performance drops.

PPO under the hood

The PPO loss function is composed of three components:

Policy Loss: PPO’s primary objective when improving the LLM.
Value Loss: Used to train the value function, which estimates the future rewards from a given state. The value function allows for computing the advantage, which in turn is used to update the policy.
Entropy Loss: Encourages exploration by penalizing certainty in the action distribution, allowing the LLM to remain creative.

The total PPO loss can be expressed as:

L_PPO = L_POLICY + a × L_VALUE + b × L_ENTROPY

where a and b are weight hyperparameters.

The entropy loss component is just the entropy of the probability distribution over the next tokens during generations. We don’t want it to be too small, as this would discourage diversity in the produced texts.

The value loss component is computed step-by-step as the LLM generates subsequent tokens. At each step, the value loss is the difference between the actual future total reward (based on the full completion) and its current-step approximation through the so-called value function. Reducing the value loss trains the value function to be more accurate, resulting in better future reward prediction.

In the policy loss component, we use the value function to predict future rewards over different possible completions (next tokens). Based on these, we can estimate the so-called advantage term that captures how better or worse one completion is compared to all possible completions.

If the advantage term for a given completion is positive, it means that increasing the probability of this particular completion being generated will lead to a higher reward and, thus, a better-aligned model. Hence, we should tweak the LLM’s parameters such that this probability is increased.

PPO alternatives

PPO is not the only optimizer used for RLHF. With the current pace of AI research, new alternatives spring up like mushrooms. Let’s take a look at a few worth mentioning.

Direct Preference Optimization (DPO) is based on an observation that the cross-entropy loss used to train the reward model in RLHF can be directly applied to fine-tune the LLM. DPO is more efficient than PPO and has been shown to yield better response quality.

Comparison between Direct Policy Optimization (DPO) and Proximal Policy Optimization (PPO). DPO (right) requires fewer steps as it does not use the reward model, unlike PPO (left). | Modified based on: Source

Another interesting alternative to PPO is Contrastive Preference Learning (CPL). The proponents claim that PPO’s assumption that human preferences are distributed according to reward is incorrect. Rather, recent work would suggest that they instead follow regret. Similarly to DPO, CPL circumvents the need for training a reward model. It replaces it with a regret-based model of human preferences trained with a contrastive loss.

A comparison between traditional RLHF and Contrastive Preference Learning (CPL). CPL uses a regret-based model instead of a reward model. | Source

Best practices for RLHF

Let’s go back to the vanilla PPO-based RLHF. Having gone through the RLHF training process on a conceptual level, we’ll now discuss a couple of best practices to follow when implementing RLHF and the tools that might come in handy.

Avoiding reward hacking

Reward hacking is a prevalent issue in reinforcement learning. It refers to a situation where the agent has learned to cheat the system in that it maximizes the reward by taking actions that don’t align with the original objective.

In the context of RHLF, reward hacking means that the training has converged to a particular unlucky place in the loss surface where the generated responses lead to high rewards for some reason, but don’t make much sense to the user.

Luckily, there is a simple trick that helps prevent reward hacking. During fine-tuning, we take advantage of the initial, frozen copy of the LLM (as it was before RLHF training) and pass it the same prompt that we pass the LLM instance we are training.

Then, we compute the Kullback-Leibler Divergence between the responses from the original model and the model under training. KL Divergence measures the dissimilarity between the two responses. We want the responses to actually be rather similar to make sure that the updated model did not diverge too far away from its starting version. Thus, we dub the KL Divergence value a “reward penalty” and add it to the reward before passing it to the PPO optimizer.

Incorporating this anti-reward-hacking trick into our fine-tuning pipeline yields the following updated version of the previous figure:

To prevent reward hacking, we pass the prompt to two instances of the LLM: the one being trained and its frozen version from before the training. Then, we compute the reward penalty as the KL Divergence between the two models’ outputs and add it to the reward. This prevents the trained model from diverging too much from its initial version.

Scaling human feedback

As you might have noticed, the RLHF process has one bottleneck: the collection of human feedback in the form of the preference dataset is a slow manual process that needs to be repeated whenever alignment criteria (labelers’ instructions) change. Can we completely remove humans from the process?

We can certainly reduce their engagement, thus making the process more efficient. One approach to doing this is model self-supervision, or “Constitutional AI.”

The central point is the Constitution, which consists of a set of rules that should govern the model’s behavior (think: “do not swear,” “be friendly,” etc.). A human red team then prompts the LLM to generate harmful or misaligned responses. Whenever they succeed, they ask the model to critique its own responses according to the constitution and revise them. Finally, the model is trained using the red team’s prompts and the model’s revised responses.

An overview of Constitutional AI. In this approach, the model is asked to follow a set of guidelines (“constitution”) and learns to critique its own misaligned responses according to it. | Modified based on: source

Reinforcement Learning from AI Feedback (RLAIF) is yet another way to eliminate the need for human feedback. In this approach, one simply uses an off-the-shelf LLM to provide preferences for the preference dataset.

A comparison between RLAIF (top) and RLHF (bottom). In RLAIF, an off-the-shelf LLM takes the place of the human to generate feedback in the form of a preference dataset. | Modified based on: s ource

Let’s briefly examine some available tools and frameworks that facilitate RLHF implementation.

Data collection

Don’t have your preference dataset yet? Two great platforms that facilitate its collection are Prolific and Mechanical Turk.

Prolific is a platform for collecting human feedback at scale that is useful for gathering preference data through surveys and experiments. Amazon’s Mechanical Turk (MTurk) service allows for outsourcing data labeling tasks to a large pool of human workers, commonly used for obtaining labels for machine-learning models.

Prolific is known for having a more curated and diverse participant pool. The platform emphasizes quality and typically recruits reliable participants with a history of providing high-quality data. MTurk, on the other hand, has a more extensive and varied participant pool, but it can be less curated. This means there may be a broader range of participant quality.

End-to-end RLHF frameworks

If you are a Google Cloud Platform (GCP) user, you can very easily take advantage of their Vertex AI RLHF pipeline. It abstracts away the while training logic; all you need to do is to supply the preference dataset (to train the Reward Model) and the prompt dataset (for the RL-based fine-tuning) and just execute the pipeline.

The disadvantage is that since the training logic is abstracted away, it’s not straightforward to make custom changes. However, this is a great place to start if you are just beginning your RLHF adventure or don’t have the time or resources to build custom implementations.

Alternatively, check out DeepSpeed Chat, Microsoft’s open-source system for training and deploying chat models using RLHF, providing tools for data collection, model training, and deployment.

Conclusion

We have discussed how important the paradigm shift brought about by RLHF is to training language models, making them aligned with human preferences. We analyzed the three steps of the RLHF training pipeline: collecting human feedback, training the reward model, and fine-tuning the LLM. Next, we took a more detailed look at Proximal Policy Optimization, the algorithm behind RLHF, while mentioning some alternatives. Finally, we discussed how to avoid reward hacking using KL Divergence and how to reduce human engagement in the process with approaches such as Constitutional AI and RLAIF. We also reviewed a couple of tools facilitating RLHF implementation.

You are now well-equipped to fine-tune your own large language models with RLHF! If you do, let me know what you built!

Was the article useful?

Thank you for your feedback!

Thanks for your vote! It’s been noted. | What topics you would like to see for your next read?

Thanks for your vote! It’s been noted. | Let us know what should be improved.

Thanks! Your suggestions have been forwarded to our editors

Explore more content topics:

Source link
lol