Learn From Failure: Fine-Tuning LLMs With Trial-and-Error Data For Intuitionistic Propositional Logic Proving [Paper Reflection]

With the rapid advancements in large language models (LLMs), transformer-based architectures are increasingly employed as tactical generators and premise selectors in automated theorem-proving systems, generating candidate proof steps or selecting useful premises based on the unfinished proof goal. According to Fields medalist Terence Tao, the new generation of AI technology will soon become useful as a “co-pilot” for research mathematicians.

However, training LLMs to serve as proof step generators faces a significant limitation: existing mathematical datasets include only correct proof paths. In academic publications, such as textbooks and research papers, mathematicians rarely include failed approaches in their presentations of proofs. Yet, it is almost always the case that these failed attempts guide them toward discovering valid proofs, and missing those failed attempts often leaves the readers wondering, “How do they get there?”.

In our paper, Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving, we explored this problem experimentally. Our goal was to assess the influence of trial-and-error information in the training data on the performance of LLMs in theorem proving.

How do mathematicians develop proofs?

In mathematical research, the number of incorrect attempts vastly outnumbers successful ones. Mathematical reasoning is inherently iterative and nonlinear, involving numerous failed approaches and refinements. The backtracking process, wherein one recognizes a failed path and revisits earlier stages to explore alternative directions, is vital to a mathematician’s chain of thought. Thus, unsuccessful paths not only pave the way to correct proofs but are also valuable as illustrations of structured proof-search techniques.

The primary motivation for using large language models (LLMs) in automated theorem provers (ATPs) is their capability to emulate human reasoning. Our ultimate goal is to capture the comprehensive and systematic methods human mathematicians use in theorem proving and potentially develop novel, superior strategies.

However, at the time we published our paper, current approaches to training LLMs for ATPs only utilized data on correct proof attempts. Given that a model solely trained on successful proof steps is learning none of the iterative trial-and-error processes mathematicians use, it is unsurprising that despite pre-training on extensive mathematical texts, the available state-of-the-art models exhibited only modest performance on challenging theorem-proving tasks.

Searching for a free experiment tracking solution for your academic research?

Join 1000s of researchers, professors, students, and Kagglers using neptune.ai for free to make monitoring experiments, comparing runs, and sharing results far easier than with open source tools.

Potential benefits of training with trial-and-error information

Now, assume that in addition to a vast collection of polished proofs, we train a model on all the trial-and-error information that could be found in mathematicians’ draft papers or in their minds. What would we expect this model to be capable of?

Generating better proof-step candidates

First, we expect the model to have a strong ability to propose high-quality guesses for single proof-step generation. After seeing large amounts of high-quality trial-and-error information in training, the model will learn how to make a highly reasonable (although possibly failed) first attempt when seeing the problem.

Judging proof step candidates in reinforcement learning

Second, we expect models trained with trial-and-error information to be capable of dynamically evaluating each proof step’s potential. By “dynamic,” we mean that the confidence score the model assigns internally to the current proof strategy changes as the strategy unfolds. After generating each proof step, the model must decide whether to continue predicting the subsequent step along the current path or to initiate a backtracking operation. A higher probability of backtracking indicates a lower confidence in the current proof strategy.

A model equipped with sufficient trial-and-error data should become proficient in assessing the viability of proof strategies. The model could then serve as a reward function in reinforcement learning processes (e.g., OpenAI’s work on process supervision), where obtaining high-quality reward functions for intermediate steps is a major challenge.

One caveat is that tracking trial-and-error information for highly complex mathematical problems can easily exceed a model’s context length. We sometimes encountered this problem in our experiments when we asked the model to prove very hard theorems. Once it is no longer possible to feed the entire history of proof attempts and backtraces into the model, we need to summarize it. Further research is required to explore efficient methods for this summarization process.

Going beyond the well-trodden path

Third, we expect a model trained on trial-and-error data to exhibit a strong capacity for thinking “outside the box.” Mathematicians often develop truly creative approaches to solving longstanding problems, producing work that impresses with its ingenuity and provokes curiosity about the thought processes involved.

However, except for a few remarkable cases (like the formulas discovered by Ramanujan), most of these breakthroughs are built on extensive knowledge accumulated over time through trial and error. By identifying existing strategies as ineffective—and uncovering why they are inadequate—mathematicians are compelled to consider novel methods. We believe models can acquire this capability from extensive, high-quality trial-and-error information.

Where do we go from here?

Overall, we are optimistic about the future of automated reasoning. We speculate that mathematical reasoning is not fundamentally different from traditional NLP tasks and that given sufficient high-quality training data, LLMs can reach human-level performance. As we demonstrate in our paper, incorporating trial-and-error information into the training data leads to substantial improvements even with today’s model architectures.

However, as we’ve discussed, the vast majority of current pre-training datasets for mathematical reasoning exhibit significant misalignments with the precise tasks we expect the model to perform. An obvious limitation of our approach is that it’s challenging to collect trial-and-error data from our math friends because of the tradition and community practice. We hope our work can raise the community’s awareness of the importance of trial-and-error data for automated theorem proving.

New state-of-the-art models (such as Meta’s Llama 3 family and OpenAI’s o1 model) that became available after we published our paper have been trained extensively on trial-and-error reasoning data. This has led to significant performance improvements on traditional mathematical benchmarks, such as the MATH dataset. Notably, o1 has the capability to verify its outputs and perform backtracking during inference, informed by previously explored proof searches. We believe this advancement is largely due to the substantial trial-and-error data included in the model’s training process.

Beyond theorem proving, training with trial-and-error data may play a pivotal role in shaping a new “scaling law of inference,” which complements currently known LLM scaling laws. By enabling the model to generate additional tokens, thereby allowing it to verify and backtrack on its own output, it can progressively tackle more complex problems. This concept, observed by OpenAI for their o1 model, was reported as a significant advancement. Furthermore, a recent paper mathematically demonstrates that if a transformer is allowed to generate an arbitrary number of tokens, it has the potential to solve arbitrarily complex problems.

If you’d like to explore this space yourself, we’ve published our dataset and our model weights over at Hugging Face, and you can find source code on GitHub. If you’re interested in how trial-and-error data could be used to improve LLM Agents, I recommend the recently published paper Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents, whose dataset is available at Hugging Face as well.

Was the article useful?

Thank you for your feedback!

Thanks for your vote! It’s been noted. | What topics you would like to see for your next read?

Thanks for your vote! It’s been noted. | Let us know what should be improved.

Thanks! Your suggestions have been forwarded to our editors

Explore more content topics:

Source link
lol