Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
While large language models (LLMs) are becoming increasingly effective at complicated tasks, there are many cases where they can’t get the correct answer on the first try. This is why there is growing interest in enabling LLMs to spot and correct their mistakes, also known as “self-correction.” However, current attempts at self-correction are limited and have requirements that often cannot be met in real-world situations.
In a new paper, researchers at Google DeepMind introduce Self-Correction via Reinforcement Learning (SCoRe), a novel technique that significantly improves the self-correction capabilities of LLMs using only self-generated data. SCoRe can be a valuable tool for making LLMs more robust and reliable and opens new possibilities for enhancing their reasoning and problem-solving abilities.
The importance of self-correction in LLMs
“Self-correction is a capability that greatly enhances human thinking,” Aviral Kumar, research scientist at Google DeepMind, told VentureBeat. “Humans often spend more time thinking, trying out multiple ideas, correcting their mistakes, to finally then solve a given challenging question, as opposed to simply in one-shot producing solutions for challenging questions. We would want LLMs to be able to do the same.”
Ideally, an LLM with strong self-correction capabilities should be able to review and refine its own answers until it reaches the correct response. This is especially important because LLMs often possess the knowledge needed to solve a problem internally but fail to use it effectively when generating their initial response.
“From a fundamental ML point of view, no LLM is expected to solve hard problems all within zero-shot using its memory (no human certainly can do this), and hence we want LLMs to spend more thinking computation and correct themselves to succeed on hard problems,” Kumar said.
Previous attempts at enabling self-correction in LLMs have relied on prompt engineering or fine-tuning models specifically for self-correction. These methods usually assume that the model can receive external feedback on the quality of the outputs or has access to an “oracle” that can guide the self-correction process.
These techniques fail to use the intrinsic self-correction capabilities of the model. Supervised fine-tuning (SFT) methods, which involve training a model to fix the mistakes of a base model, have also shown limitations. They often require oracle feedback from human annotators or stronger models and do not rely on the model’s own knowledge. Some SFT methods even require multiple models during inference to verify and refine the answer, which makes it difficult to deploy and use them.
Additionally, DeepMind’s research shows that while SFT methods can improve a model’s initial responses, they do not perform well when the model needs to revise its answers over multiple steps, which is often the case with complicated problems.
“It might very well happen that by the end of training the model will know how to fix the base model’s mistakes but might not have enough capabilities to detect its own mistakes,” Kumar said.
Another challenge with SFT is that it can lead to unintended behavior, such as the model learning to produce the best answer in the first attempt and not changing it in subsequent steps, even if it’s incorrect.
“We found behavior of SFT trained models largely collapses to this ‘direct’ strategy as opposed to learning how to self-correct,” Kumar said.
Self-correction through reinforcement learning
To overcome the limitations of previous approaches, the DeepMind researchers turned to reinforcement learning (RL).
“LLMs today cannot do [self-correction], as is evident from prior studies that evaluate self-correction. This is a fundamental issue,” Kumar said. “LLMs are not trained to look back and introspect their mistakes, they are trained to produce the best response given a question. Hence, we started building methods for self-correction.”
SCoRe trains a single model to both generate responses and correct its own errors without relying on external feedback. Importantly, SCoRe achieves this by training the model entirely on self-generated data, eliminating the need for external knowledge.
Previous attempts to use RL for self-correction have mostly relied on single-turn interactions, which can lead to undesirable outcomes, such as the model focusing solely on the final answer and ignoring the intermediate steps that guide self-correction.
“We do see… ‘behavior collapse’ in LLMs trained to do self-correction with naive RL. It learned to simply ignore the instruction to self-correct and produce the best response out of its memory, in zero-shot, without learning to correct itself,” Kumar said.
To prevent behavior collapse, SCoRe uses a two-stage training process with regularization techniques. The first stage replaces SFT with a process that optimizes correction performance while ensuring that the model’s initial attempts remain close to the base model’s outputs.
The second stage employs multi-turn RL to optimize reward at both the initial and subsequent attempts while incorporating a reward bonus that encourages the model to improve its responses from the first to the second attempt.
“Both the initialization and the reward bonus ensure that the model cannot simply learn to produce the best first-attempt response and only minorly edit it,” the researchers write. “Overall, SCoRe is able to elicit knowledge from the base model to enable positive self-correction.”
SCoRe in action
The DeepMind researchers evaluated SCoRe against existing methods that use self-generated data for self-correction training. They focused on math and coding tasks, using benchmarks such as MATH, MBPP, and HumanEval.
The results showed that SCoRe significantly improved the self-correction capabilities of Gemini 1.0 Pro and 1.5 Flash models. For example, SCoRe achieved a 15.6% absolute gain in self-correction on the MATH benchmark and a 9.1% gain on the HumanEval benchmark in comparison to the base model, beating other self-correction methods by several percentage points.
The most notable improvement was in the model’s ability to correct its mistakes from the first to the second attempt. SCoRe also considerably reduced the instances where the model mistakenly changed a correct answer to an incorrect one, indicating that it learned to apply corrections only when necessary.
Furthermore, SCoRe proved to be highly efficient when combined with inference-time scaling strategies such as self-consistency. By splitting the same inference budget across multiple rounds of correction, SCoRe enabled further performance gains.
While the paper primarily focuses on coding and reasoning tasks, the researchers believe that SCoRe can be beneficial for other applications as well.
“You could imagine teaching models to look back at their outputs that might potentially be unsafe and improve them all by themselves, before showing it to the user,” Kumar said.
The researchers believe that their work has broader implications for training LLMs and highlights the importance of teaching models how to reason and correct themselves rather than simply mapping inputs to outputs.
Source link lol