The first field to see a major breakthrough in AI: Mathematics

 Datawhale dry goods 

Direction: ChatGPT application , source: Xinzhiyuan

[Introduction] Using the large language model theorem to prove that the latest research by a Chinese at Caltech may change the future of mathematics.

Large language models can be used to prove mathematical theorems!

"Mathematical genius" Tao Terence once stated in a blog that in 2026, AI will be combined with search and symbolic mathematics tools to become a trusted co-author in mathematical research.

This prophecy has now come true!

Scholars from Caltech, NVIDIA, MIT and other institutions have built a theorem prover based on open source LLM.

And this paper may change the future of mathematics.

e45a4609db30563f82b78f80318cb9db.png

Project address: https://leandojo.org/

Here, the researchers propose an open source platform, LeanDojo, providing toolkits, benchmarks, and models to create an interactive environment for theorem proving for LLM.

Mathematics: The first field to see a major breakthrough in AI


Yang Kaiyu, the first author of the paper, said that the formula proof is a computer program, and its correctness can be verified.

Most importantly, this study opens a new avenue for addressing the factual and hallucinatory deficiencies of the LLM.

Because, theorem proving is a form of code generation with rigorous evaluation, there is simply no room for hallucinations of the model.

Nvidia Chief Scientist Jim Fan excitedly retweeted: The first subject to witness a major breakthrough in artificial intelligence is likely to be mathematics!

He said: Everyone should read mathematician Terence Tao's blog. In this blog, Tao predicts that in 2026, AI will combine with search and symbolic mathematics tools to become trusted co-authors in mathematical research.

70bf93479f64a34a501a62a98b50fc6d.png

Why is AI's first major breakthrough in mathematics? The reasons are as follows—

- Math can be conveniently expressed as coding problems

- Can be rigorously verified by a theorem prover like Lean instead of relying on empirical results

- No need for physical experiments like biology and medicine, the development of robotics has yet to progress

GPT is good at coding, and Lean is a coding language for formula math without hallucinations.

Artificial intelligence math co-pilots are here. A fully automated AI mathematician who discovers new theorems is next!

ea26657ed004d1143ffa010dc2267d6f.png

Some netizens said, so Tao Zhexuan can be fired and easily replaced, isn't it?

85ba217048a6360ceb7701800c5f60a5.png

How strong is LeanDojo?

LeanDojo: an interactive environment for theorem proving


Machine learning, especially large language models, shows great promise in proving formula theorems using the proof assistant Lean.

LeanDojo's key features include:

- Provides tools for data extraction and interaction with Lean 

- Fine-grained notation of premises (existing theorems) in proofs: where these premises are used and defined 

- LeanDojo Benchmark: 97,000 human-written theorems/proofs for developing machine learning models for theorem proving

- ReProver (Retrieval Enhanced Prover): The first LLM-based prover that specifically enhances the retrieval of Premise Selection

Lean is a very popular proof assistant tool among mathematicians.

The research team processed and improved Lean and developed LeanDojo. It can extract the human-written proof process from Lean to form a data set.

Thus, by interacting with Lean's proof environment, the trained model can be used to prove theorems.

The workflow and principle of LeanDojo are roughly shown in the following figure:

6d977061115190633ea5840b761f9a4f.png

Top right: LeanDojo extracts proofs from Lean into the database for training machine learning models.

This process can also allow the trained model to prove the theorem after interacting with Lean's proof environment.

Top left: This is 8ddd98f0aee968bb085b93d94535ec50.pngthe proof tree for Lean's theorem. Here gcd means the greatest common divisor.

When proving theorems, we start with the original theorem as an initial state (the root) and repeatedly apply strategies (edges) to decompose the state into simpler substates until all states are resolved (at the leaf nodes).

Strategies may rely on premises such as mod_self and gcd_zero_left defined in large math libraries.

For example, mod_self is an existing theorem used in the proof to simplify the objective:9769eebbce117df42544b5ee2f3af1be.png

Bottom: Given a state, the Reprover model retrieves premises from a mathematical library, which are concatenated with the state and fed into a Transformer that acts as an encoder and decoder to generate the next policy.

Benchmarks Benchmarks


- LeanDojo Benchmark: 96,962 theorems/proofs, 212,787 strategies, and 128,163 premises extracted from mathlib.

- LeanDojo Benchmark 4: 91,766 theorems/proofs and 177,349 strategies extracted from mathlib4. Prerequisite information will be available shortly.

LeanDojo can pull data from any GitHub repository in Lean (supports Lean 3 and Lean 4).

This data contains rich information not directly visible in the original Lean code, including file dependencies, Abstract Syntax Trees (AST), proof state, policies, and premises.

Key Feature 1: Prerequisite Information

The LeanDojo Benchmark includes fine-grained annotation of premises (using them in proofs as well as defining them in libraries), providing valuable data for premise selection (a critical bottleneck in theorem proving).

Main Feature 2: Challenging Data Segmentation

Randomly splitting theorems into train/test leads to overestimation of model performance. Large language models can prove seemingly difficult theorems by memorizing proofs of similar theorems during training.

Researchers alleviate this problem by designing challenging data splits that require models to generalize to theorems based on innovative premises that were never used in training.

Interact with Lean

4094a5f0bf94326a5ed04fca5fdfaf3c.jpeg

As shown above, LeanDojo turns Lean into a gymnasium-like environment where mathematicians can observe the state of a proof, run strategies to change the state, and receive feedback on errors or proof completion.

Such an environment is essential for evaluating/deploying provers or training via reinforcement learning.

Experimental evaluation


The researchers used the LeanDojo Benchmark to train and evaluate ReProver.

The figure below shows the percentage of theorems proved in 10 minutes. Each column represents a different data split.

ReProver outperforms Lean's built-in proof automation strategy (tidy), providing a benchmark for directly generating strategies without retrieval.

Another benchmark the researchers employed uses GPT-4 to generate policies in a zero-shot fashion.

33f3ce63fe4b828b4c430d7a6b1bef6a.png

New Proofs Discovered & Formula Errors Discovered

The researchers used theorems in miniF2F and FroofNet to evaluate ReProver.

They found 33 proofs in miniF2F and 39 proofs in ProffNet that do not exist in Lean.

At the same time, the latest research also found multiple errors in the formulation of ProofNet's theorem statement.

For details, see: https://github.com/zhangir-azerbayev/ProofNet/pull/14

ChatGPT plugin

The researchers also built a ChatGPT plugin for LeanDojo, enabling ChatGPT to prove theorems by interacting with Lean.

Specifically, they tried three mathematical formulas, including a+b+c=a+c+b, Stirling's formula, and Gauss' summation formula.

It was found that compared to the professional fixed-force proof LLM (ReProver), ChatGPT can intersect informal mathematics with formal proof steps, similar to the way humans interact with proof assistants.

It can even explain Lean's error messages, and is easier to control (by hint engineering) than professional provers.

However, it is difficult to find the correct proof in most cases due to weaknesses in search and planning.

The specific demonstration is as follows:

a+b+c=a+c+b

Stirling's formula

Gauss' summation formula

On GitHub, the developer gives an example of using the demo method:

After the plugin is installed successfully, you can ask ChatGPT to prove the theorem, just tell it the name and definition of the theorem. for example:

I want you to prove a theorem in Lean. The theorem's name is `hello_world`, and it is defined in the file `src/example.lean` in `https://github.com/yangky11/lean-example`. Please explain the theorem to me, lay out a high-level proof plan, and then try various tactics to prove the theorem.

Initializing the proof search may take some time.

You can use hints to control the behavior of ChatGPT. For example, you can ask it to "generate an advanced proof plan" before attempting any test.

Reviews

9b500fc8f306b83a7fb445b4397b39fc.png

This discovery is the best application of AI in the field of mathematics, and it has found a very realistic perspective for AI to contribute to mathematics research.

64f22e66bd7fa0b2bedd177528670d29.png

We are one step closer to the grand goal of formally proving all mathematical formulas!

6d8170c65badf01c4393de67b8adf37c.png

Mathematical proofs are really tailor-made tasks for large language models, because the validity of the results can be fully guaranteed.

1b193da561d6382abc28855212200404.png

In addition to praising the acceleration of mathematics research by this project, netizens have opened their minds and fantasized about many future possibilities.

d7b04d1675d78854fd19c50b60155f21.png

Cue Boss Ma, the rapid development of mathematics will allow human beings to enter a world that only exists in science fiction.

ad1408b2a966d6f29cfd059abe55fcad.png

Because mathematics is the mother of science, the rapid development of mathematics will lead to continuous acceleration of all natural sciences.

It does make sense that mathematics would be the first scientific discipline to see a major breakthrough in artificial intelligence.

edcf8ab4a847ae7f754f2aacf36a4831.png

paper author

Kaiyu Yang

59cfceb052984659bf938c00e6603c44.png

Kaiyu Yang is a postdoctoral fellow in the Department of Computing + Mathematical Sciences (CMS) at Caltech, supervised by Anima Anandkumar. He received his Ph.D. from Princeton University under the supervision of Jia Deng, and also worked with Olga Russakovsky and Chen Danqi.

His research focuses on neuro-symbolic artificial intelligence, which aims to enable machine learning to perform symbolic reasoning.

Yang Kaiyu achieved the goal from two perspectives: (1) applying machine learning to symbolic reasoning tasks, such as mathematical reasoning and theorem proving in formal logic or natural language; (2) introducing symbolic components into machine learning models to make them more predictable. Interpretability, verifiability, and data efficiency.

Currently, he is working on artificial intelligence that can understand and reason about mathematics. Mathematical reasoning is a major milestone in human intelligence and has the potential to transform many important problems in science and engineering, such as solving partial differential equations and formula verification.

Alex Gu

8367b287f165b1701925361e8bf99902.png

Alex Gu is a PhD student at MIT under the supervision of Armando Solar-Lezama. He also earned his BS and MS degrees from the Massachusetts Institute of Technology under the tutelage of Armando and Jacob Andreas.

Alex Gu has interned at Meta AI Research, Jane Street and pony.ai.

Peiyang Song

75c3cdfda6eb58cf6aae941f41782913.png

PeiYang Song is an Honors Bachelor of Science in Computer Science candidate at the University of California, Santa Barbara (UCSB) College of Creative Studies (CCS).

Research interests include machine learning and its applications in natural language processing, computer vision, and its intersection with computer architecture, programming languages, and more.

His recent research work is mainly in two directions: 1) neural theorem proving and automatic reasoning combining large language model (LLM) and interactive theorem prover (ITP); 2) temporal logic for energy-efficient machine learning reasoning.

References:

https://leandojo.org/

https://twitter.com/KaiyuYang4/status/1673882824158613504

https://twitter.com/DrJimFan/status/1674083328478318594

f3e6ae7bf3334d2c10a9de313820499c.png

Dry goods learning, like three times

Guess you like

Origin blog.csdn.net/Datawhale/article/details/131507452