Reciting does not mean understanding, in-depth analysis of the knowledge storage and extraction behind large models

Memorization of natural language models does not equal understanding. Even if the model can remember all the data completely, it may not be able to extract this knowledge through fine-tuning and answer simple questions.

As the size of the model increases, people begin to explore how large models can master a large amount of knowledge. One view is that this is due to "lossless compression", that is, the model undergoes extensive training and memorizes more content to improve prediction accuracy. But can "lossless compression" really allow large models to understand this knowledge? The latest research "Language Model Physics Part 3.1: Knowledge Storage and Retrieval" by Zhu Zeyuan (MetaAI) and Li Yuanzhi (MBZUAI) explores this issue in depth .

Paper address: https://arxiv.org/pdf/2309.14316.pdf

Regarding human beings, there is a saying that "read a book a hundred times, and its meaning will appear by itself." Although this sentence does not apply to all knowledge, for simple knowledge, as long as we can remember the relevant books, we can easily answer related questions. For example, as long as we remember the ancient poem "Silent Night Thoughts", we can easily answer "What does the moonlight compare to in the poem?"; as long as we remember the paragraph about "Chu Shi Biao/Creative Background" in Baidu Encyclopedia, we can easily answer "Chu Shi Biao" When was the creation of?". So, can larger models do the same?

picture

Figure 1: Some examples of knowledge extraction by GPT-4 (left picture is ChatGPT, right picture is API)

Although GPT-4 can understand and repeat paragraphs related to the question, why can't it answer simple questions like humans? Is it because the model is not large enough, the memory is insufficient, or the fine-tuning after training is not enough? neither! The article points out that even if a natural language model is large enough, trained long enough, and fine-tuned sufficiently, it may still not be able to answer questions that humans think are simple. The underlying reason for this has to do with the way knowledge is presented in pretrain data. The same knowledge needs to appear multiple times in the pre-training data set and has enough " diversity " to be easier to extract after fine-tuning.

To confirm this, the two authors created a dataset containing 100k biographies. Each character has a biography entry containing the person's name and six fixed attributes: date of birth, place of birth, university major, university name, and work location. ,employer. They designed two data sets, BioS and BioR. Each sentence of BioS was selected from 50 fixed templates, and BioR was rewritten with LLaMA-30B, which is more realistic and diverse. The results of the two data sets are consistent. Taking BioS as an example, a sample entry is shown below:

Anya Briar Forger was born on October 2, 1996. She spent her early years in Princeton, NJ. She received mentorship and guidance from faculty members at MIT. She completed her education with a focus on Communications. She had a professional role at Meta Platforms. She was employed in Menlo Park, CA.

figure 2

Even if a natural language model is perfectly pretrained (pretrained) on 100k personal autobiographies, it will not be able to accurately answer the question "Which school Anya went to for undergraduate" through QA fine-tuning (finetuning). As shown in Figure 2, even if 50k people are used as QA fine-tuning training data and various fine-tuning methods are tried, including LoRA, the model's accuracy on the remaining 50k people is only 10%. Even though a 682M model (7000 times larger than the number of people) was used and trained 1350 times, and the author even added standard NLP pre-training data such as WikiBook, the accuracy rate did not improve. It can be seen that "with great force, miracles" did not happen.

Therefore, large models do not necessarily capture or extract knowledge of "lossless compression". So how does GPT-4 master knowledge? In order to study this problem, the two authors made changes to the pre-training set - the authors called it knowledge enhancement :

1. Diversity - multiM: Create M biography entries for each person, using different narrative languages ​​but retaining the same information (there are a total of 100 narrative methods for each sentence, and each sentence of each biography selects one from them)

2. Random arrangement - permute: Randomly arrange the biographical sentences

3. Full name - fullname: Replace all pronouns, surnames, and first names in the biography with the full name

The authors called the original data set bioS single and experimented with 15 combinations of knowledge enhancements. For example, bioS multi5+permute means that each person has 5 biographies, and the word order is disrupted. Here is an example of bioS multi5+permute:

Anya Briar Forger originated from Princeton, NJ. She dedicated her studies to Communications. She gained work experience in Menlo Park, CA. She developed her career at Meta Platforms. She came into this world on October 2, 1996. She pursued advanced coursework at MIT.

For humans and large models, remember that bioS single and bioS multi5+permute are almost equally difficult (they have the same amount of information, and each sentence is selected from 50 templates). So, if pretraining is performed on this new knowledge-enhanced data set and then QA is fine-tuned, will there be any new performance?

picture

image 3

Figure 3 shows that the QA accuracy rate of the bioS single pre-trained model is only 9.7%, while the accuracy rate of the bioS multi5+permute pre-trained model is as high as 96.6%. This significant improvement has nothing to do with model fine-tuning, size, or training time , but with how the knowledge is presented in pretraining (pretrain) , that is, how the knowledge is "recited" by the large model.

The study also found that by dividing biographies into celebrities and minority groups, as long as the celebrity biography has knowledge enhancement, even if the minority group does not, the accuracy of the model's knowledge extraction for the minority group will be greatly improved - of course, the best The effect still requires knowledge enhancement of all data.

picture

Figure 4: Simply by increasing the diversity of training data for celebrities, the accuracy of knowledge extraction for minority groups soars.

So why does the model’s question-answering ability vary greatly after reciting different data? Why can repeated recitation of celebrity biographies enhance the knowledge extraction ability of minority groups? The reason is that the models adopt different memory methods.

The author deeply explores the principle of the model's memory knowledge through two linear probing. Let's look at one method called P-probing.

In P-probe, we input biographical entries into the pre-trained model and train a linear classifier to predict six target attributes (such as university, major, etc.). We wanted to see if the model could extract this information earlier than the attributes. If the classifier shows a high accuracy for "work unit" immediately after the person's name, it means that the model directly learned "Anya's employer is Meta". If high accuracy is achieved only at the end of the biography, it may be that the model uses a flawed memory method, such as "someone's birthday is October 2, 1996, the university is MIT, so the employer is Meta".

The experimental design for the P probe is as follows. Find the positions in each biography where the 6 attributes first appear, and then train a linear classifier to predict each target attribute at the position immediately preceding these positions. This resulted in 36 classification tasks.

picture

Figure 5: P-probe test results show that the knowledge enhancement of the pre-training data set causes knowledge to be stored in earlier locations, and some are even stored directly on people's names. Whether the model can answer questions through fine-tuning is related to whether the information is stored directly on the person's name during pre-training (compare Figure 3 and Figure 5).

The results of the P-probe test show that the natural language model can remember information through people's names to achieve compression during pre-training, and can also use other information (such as "The work unit of a person who studied at MIT and whose birthday is October 2, 1996 is ...")memory. Although the second memory method is "unnatural" to humans, the compression ratios of the two methods are the same for the model. If the model uses the second method to remember information, it will not be able to answer questions through fine-tuning after training. Through knowledge enhancement, the pre-trained model will gradually tend to learn to use the first memory method.

One might argue that the above "knowledge extraction" failure may be due to the one-way nature of autoregressive language models such as GPT. In fact, bidirectional language models such as BERT are even worse at knowledge extraction. They can only store multi-phrase knowledge such as "Meta Platform" but cannot extract it. Interested readers can refer to Chapter 6 of the paper.

In general, whether the language model can answer the "knowledge extraction" question depends not only on "lossless compression", but also on "how to compress in the model". The paper emphasizes that it is necessary to enhance knowledge of key but rare data during the pre-training process (such as using ChatGPT for multiple rewrites). Without this step, no matter how hard you work on fine-tuning, although the pre-trained model has losslessly compressed the training data, it may still be unable to extract that knowledge!

Conclusion

How to understand how natural language models work? Most researchers speculate on its capabilities by talking to models such as GPT-4. However, the author of the "Language Model Physics" series of papers proposed a more precise method to explore the internal mechanism of Transformer and explain its ability to handle AI tasks through carefully designed training data and controlled experiments.

In "Part 3.1: Knowledge Storage and Extraction", the author accurately tested the model's response to different data and found the accurate relationship between the model's learning knowledge and ability and the training data.

They also released "Part 3.2: Operation of Knowledge" to further study how the model operates knowledge in specific situations. For example, if the large model remembers "Silent Night Thoughts", can it be fine-tuned to infer that the last sentence of "Quiet Night Thoughts" is "Bow your head and miss your hometown"? We will bring you follow-up reports soon.

Guess you like

Origin blog.csdn.net/leyang0910/article/details/133392214