A Survey of WebGLM and Related Work

Tsinghua Tang Jie's new work WebGLM: 10 billion parameters, main online search, performance surpasses OpenAI WebGPT
github warehouse address: https://github.com/THUDM/WebGLM
The forum report playback on WAIC last Friday. Put two for your reference first, and other reports will be placed on station B one after another. In addition, there is a prompt course that is close to landing practice, which will be posted on station B in the near future.

Model address: https://huggingface.co/THUDM/WebGLM
[Report] Path exploration of ChatGLM
https://www.bilibili.com/video/BV1cm4y1E7uV

[Report] WebGLM: Retrieval-enhanced large-scale pre-training model
https://www.bilibili.com/video/BV1f94y1q7pU/
insert image description here
insert image description here
insert image description here
Atlas: In the final analysis, the goal of the retrieval-enhanced model is to expect the model to not only learn to memorize data, but also to learn to find Data, this feature has great advantages in many knowledge-intensive tasks and retrieval enhancement models have also achieved great success in these fields, but it is not known whether retrieval enhancement is suitable for small sample learning. Going back to this paper of Meta AI, the application of retrieval enhancement in small sample learning was successfully tested, and Atlas came into being. https://zhuanlan.zhihu.com/p/564646449
insert image description here Atlas has two sub-models, a retriever and a language model. When faced with a task, Atlas uses the retriever to generate the most relevant top-k documents from a large amount of corpus according to the input question, and then puts these documents into the language model together with the question query to generate the required Output. insert image description here
The basic training strategy of the Atlas model is to train the retriever and the language model together using the same loss function. Both the retriever and the language model are based on the pre-trained Transformer network, where:

The retriever is designed based on Contriever. Contriever is pre-trained with unsupervised data and uses a two-layer encoder. The query and document are independently encoded into the encoder, and the similarity between the query and the document is obtained through the dot product of the corresponding output. This design allows Atlas to train the retriever without document annotations, significantly reducing memory requirements.
The language model is trained based on T5, and different documents and queries are spliced ​​together, which are processed independently by the encoder. Finally, the decoder performs Cross-Attention on all retrieved paragraphs in series to obtain the final output. This Fusion-in-Decoder method helps Atlas to effectively adapt to the expansion of the number of documents.
It is worth noting that the author compared and tested the four loss functions and the case of not doing joint training of the retriever and the language model, and the results are as follows:
insert image description here

It can be seen that in a small sample environment, the accuracy rate obtained by using the joint training method is significantly higher than that without joint training. Therefore, the author concludes that the joint training of the retriever and the language model is Atlas is the key to obtaining small-sample learning capabilities.

  1. Experimental results
    In the large-scale multi-task language understanding task (MMLU), compared with other models, Atlas has a better accuracy rate than GPT-3 with 15 times the number of parameters of Atlas when the number of parameters is only 11B. After task training, the correct rate in the 5-shot test is even close to Gopher, which is 25 times the number of Atlas parameters.
    insert image description here

In the two test data of open domain question answering - NaturalQuestions and TriviaQA, the performance of Atlas and other models on 64 examples and the performance of the full training set are compared. As shown in the figure below, Atlas has achieved new results in 64-shot SOTA achieved an accuracy rate of 84.7% on TrivuaQA with only 64 data.

insert image description here

In the fact-checking task (FEVER), Atlas's performance in small samples is also significantly better than Gopher and ProoFVer, which has dozens of times more parameters than Atlas. In the 15-shot task, it exceeds Gopher by 5.1%.
insert image description here

On the self-published knowledge-intensive natural language processing task benchmark KILT, the correct rate of Atlas trained with 64 samples in some tasks is even close to the correct rate obtained by other models using full samples. After using full samples to train Atlas , Atlas refreshed SOTA on all five datasets.
insert image description here

  1. Interpretability, controllability, and updateability
    According to the research in this paper, the retrieval enhancement model is not only smaller and better, but also has significant advantages in interpretability that other large models do not have. The black box property of the large model makes it difficult for researchers to use the large model to analyze the operating mechanism of the model, while the retrieval enhanced model can directly extract the retrieved documents, so that by analyzing the articles retrieved by the retriever, the Atlas work can be obtained. better understanding.

For example, the paper found that in the field of abstract algebra, 73% of the corpus of the model is from Wikipedia, while in the domain of ethics, only 3% of the documents extracted by the retriever come from Wikipedia, which is in line with human intuition. As shown in the statistical chart on the left of the figure below, although the model prefers to use CCNet data, the usage rate of Wikipedia articles has increased significantly in the STEM field that pays more attention to formulas and reasoning.
insert image description here

According to the statistical graph on the right side of the above figure, the author found that as the number of retrieved articles containing correct answers increases, the accuracy rate of the model also continues to increase. When the articles do not contain answers, the correctness is only 55%, while when the answers are mentioned When more than 15 times, the correct rate came to 77%. In addition, when the documents retrieved by 50 retrievers were manually checked, it was found that 44% of them contained useful background information. Obviously, these materials containing background information of the problem can provide a great resource for researchers to expand their reading. help.

Generally speaking, we tend to think that there is a risk of "leakage" of training data in large models, that is, sometimes the answers of large models to test questions are not based on the learning ability of the model but on the memory ability of the large model. The answers to the test questions were leaked in a large amount of corpus. In this paper, after the author manually eliminated the corpus information that may be leaked, the correct rate of the model dropped from 56.4% to 55.8%, only a drop of 0.6%. It can be seen that The method of retrieval enhancement can effectively avoid the risk of model cheating.

Finally, updateability is also a unique advantage of the retrieval-enhanced model. The retrieval-enhanced model can be updated from time to time without retraining, but only needs to update or replace the corpus it relies on. By constructing a time series data set, as shown in the figure below, without updating the Atlas parameters, the author achieved a correct rate of 53.1% only by using the 2020 corpus Atlas, and what is interesting is that even with the 2020 data fine-tuning T5, T5 also did not perform well. The author believes that the reason is largely due to the fact that the data used for T5 pre-training is data before 2020.

insert image description here

  1. Conclusion
    We can imagine that there are three students. One student only relies on rote memorization to solve a problem. One student can memorize the answer to a math problem. The students answered one by one again, and the last student was talented and smart. After simply learning some knowledge from textbooks, he could confidently go to the examination room and give pointers.

Obviously, the ideal of small sample learning is to become the third student, but the reality is likely to stay above the first student. Large models are easy to use, but "big" is by no means the ultimate goal of the model. Going back to the original intention of learning from small samples and expecting the model to have the ability to reason and judge and draw inferences similar to humans, then we can see that this paper is from another perspective. It’s better to take a step forward, at least to make it easier for the student not to fill his head with so much possibly redundant knowledge, but to pick up a textbook and go into battle lightly, maybe even if the student is allowed to take the textbook for the open-book exam , It will be closer to intelligence than students memorizing by rote!

insert image description here
insert image description here
insert image description here
Retriever: search engine retriever, scorer without manual annotation
insert image description here

The large model is used as the label, and the result of the large model is 90.2% correct
insert image description here
insert image description here

insert image description here
Propose a set of metrics for evaluating question answering with quoted long text
insert image description here
insert image description here

insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/stay_foolish12/article/details/131701513