July paper review GPT version 2: from Meta Nougat and GPT4 review to Mistral and LongLora Llama

Preface

Like this previous article " Source code interpretation and fine-tuning of academic paper GPT: from chatpaper, gpt_academic to July paper review GPT" As mentioned in the third part of the article, for the abstract/summary, dialogue, translation, and grammar check of the paper, although the effect of academic paper GPT on the market is not very good yet, it is at least passable. If revision/review is required, the effectiveness of existing GPT academic papers on the market will be greatly reduced

What's the reason? The essential reason is that no matter what the functions are, they are basically implemented based on the API. The key is that the API is not omnipotent after all. The API is fine for translation/summarization/dialogue, but if you want to provide review comments on the paper, the API is stretched. , so in order to achieve better review effects, it is necessary to use specific alignment data sets for fine-tuning to obtain a model with excellent review capabilities.

Then, in the first version, we did the following three things

  1. Crawled more than 30,000 papers and hundreds of thousands of review data, and analyzed more than 30,000 papers in PDF form (After crawling the review data, it becomes text Data, no need to parse)
    Of course, some papers are accepted and some are rejected
  2. In order to improve the data quality, a series of data processing has been done for paper and review
    Of course, the main processing is for review data ( Specifically In other words, what is done to paper is to remove the reference, and the rest is data processing done to review )
  3. Fine-tuning based on RWKV, but due to its serious forgetting mechanism, the final effect did not meet expectations.

So, after entering Q4, our company’s paper review GPT project team began to work on the second version ( Our company is currently iterating three major LLM projects, each project has its own A project team, in addition to the paper review GPT led by A Xun, also includes: led by Huo Ge AIGC model generation system, Chaoyang’s enterprise knowledge base Q&A), and focus on the optimization of the following three major aspects

  • To optimize data parsing and processing, an OCR of meta, namely "nougat", can extract LaTeX. Of course, we are also simultaneously comparing the effect of another parser, sciencebeam.
  • Learning from the paper where GPT4 was the reviewer, let ChatGPT API crawl the review corpus to sort out the following 4 aspects of content
    1 Importance and novelty a>
    2 Reasons why the paper was accepted
    3 Reasons why the paper was rejected
    4 Suggestions for improvement
  • Optimization of the model itself, such as Mistral or llama longlora

Part 1 Analysis of PDF data of the paper in the second edition

1.1 Two major PDF parsers: nougat VS ScienceBeam

1.1.1 Meta nougat

nougat is Meta’s open source parsing tool for academic PDF documents (its homepage,its code Warehouse), using the OCR method as the main line. Compared with previous parsing solutions, the most outstanding feature is that it can accurately identify formulas and tables and convert them into text that can adapt to Markdown format. The disadvantage is that the conversion speed is slow and the parsed content may be out of order

Comparing it with another parser sciencebeam, we can see that

  • The best thing about nougat is that it can decompose picture formulas into LaTeX source code. In addition, the identified content can be decomposed into text segments through the "#" symbol
    The flaw is efficiency. It is very low and very slow. If you parse three PDFs with a total of about 80 pages, it will take about 2 minutes and occupy 20G of video memory. If it is to be applied and users are required to send PDFs for analysis, the deployment may be a bit difficult.
  • Sciencebeam is much faster. Three articles of the same magnitude can be completed in about 1 minute, which is similar to the SciPDF used in the first edition. It only requires a CPU to drive it.

Of course, we also need to consider the formatting granularity of the parser, such as what parts the text is divided into. Do we need to specifically take out specific parts of the text for processing later? If the formatting granularity is not good, it may More difficult to take out

  1. Environment configuration
    # 新建虚拟环境
    conda create -n nougat-ocr python=3.10
    # 激活虚拟环境
    conda activate nougat-ocr
    # 使用pip安装必要库(镜像源安装可能会出现版本冲突问题,建议开启代理使用python官方源进行安装)
    pip install nougat-ocr -i https://pypi.org/simple
  2. Instructions
    # 初次使用时会自动获取最新的权重文件
    # 针对单个pdf文件
    nougat {pdf文件路径} -o {解析输出目录}
    # 针对多个pdf所在文件夹
    nougat {pdf目录路径} -o {解析输出目录}
  3. Test example
    Title and beginning
    Formula recognition and conversion
    Footnote recognition

1.1.2 ScienceBeam

ScienceBeam is a variant project of the classic PDF document parser GROBID. It is the text extraction method used in the paper "Can large language models provide useful feedback on research papers? A large-scale empirical analysis". Like other earlier parsing methods, The formula cannot be parsed at the LateX level, and this parser can only be used in Linux systems with X86 architecture.

// To be updated

1.2 Analysis of 26,000 papers

Finally, 26,000 papers with reviews ( a total of 30,000 papers in the first edition , including 25,000 papers with reviews; the second edition has 32,000 papers, including 26,000 papers with reviews )

  1. One of our review project team used two 24G P40s to parse half of them, and another person used a 48G A40 to parse the other half.
  2. Because nougat is too resource-intensive to parse, and our cards were limited at the time, it took us a week or two to parse this PDF...

// To be updated

Part 2: Processing of paper and review data in the second version

2.1 Processing of paper data in the second version

In the first version, we processed the paper and review data as follows (Note: The paper was removed from the reference. The rest is data processing for reviews)

In short, in the first edition

  • For paper, it is more about PDF parsing. In terms of data processing, only the reference item is removed.
  • For reviews, more data processing is done. After all, as mentioned in the preface of this article: "After the review data is crawled down, it will be text data, and no analysis is required."

In the second version, the paper processing method follows the first version, that is, the reference item in the paper is removed.

2.2 Processing of review data

The "b_forum" field is the foreign key associated with the Paper data, and "b_forum" is the unique identifier (id) of the corresponding Paper.

  • If the Review data corresponding to a certain paper is only a single row, it is a single Review.
  • But many times, a single Paper may correspond to multiple Reviews, so there is a situation where b_forum is the same under multiple rows of data.

For the original data, we do the following 4 points of processing:

  1. Filter Reviews outside the requirements
    Mainly to remove the author’s own replies and comments on the paper
  2. Stringify Review
  3. Filter reviews with too little content
  4. Standardize the content of the Review into 4 key points and “gather them into one”, as detailed below

The code for this part of the data processing is temporarily available in the "Large Model Project Development Online Camp" in July.

// To be updated

Part 3: Further processing of review data: standardizing the format of the Review and gathering it into one

3.1 Stanford: Let GPT4 serve as a reviewer for a paper for the first time

Recently, researchers from Stanford University and other institutions have sent thousands of top conference articles from Nature, ICLR, etc. to GPT-4, allowing it to generate review comments and modification suggestions, and then compare them with the opinions given by human reviewers. Compare

So, how to get LLM to review your manuscript? Specifically, as shown in the figure below

  1. Crawl PDF corpus
  2. Next, parse the title, abstract, figures, table titles, and main text of the PDF paper
  3. Then tell GPT-4 that you need to follow the review feedback form of the industry's top journal conferences, which includes four parts
    Whether the results are important and novel (signifcance and novelty) a>
    Potential reasons for acceptance
    Potential reasons for rejection
    Suggestions for improvement ( suggestions for improvement)
    Your task now is to draft a high-quality review outline for a top-tierMachine Learning (ML) conference fora submission titled “{PaperTitle}”:
    
    ```
    {PaperContent}
    ```
    
    ======
    Your task:
    Compose a high-quality peer review of a paper submitted to a Nature family journal.
    
    Start by "Review outline:".
    And then:
    "1. Significance and novelty"
    "2. Potential reasons for acceptance"
    "3. Potential reasons for rejection", List multiple key reasons. For each key reason, use **>=2 sub bullet points** to further clarify and support your arguments in painstaking details. Be as specific and detailed as possible.
    "4. Suggestions for improvement", List multiple key suggestions. Be as specific and detailed as possible.
    
    Be thoughtful and constructive. Write Outlines only.
  4. In the end, GPT-4 sharply pointed out the paper in the picture above: Although the paper mentioned the modal gap phenomenon, it did not propose a method to reduce the gap, nor did it prove the benefits of doing so.

3.2 In order to make the model's learning of review more traceable: standardize 4 key points and gather them into one

3.2.1 Design a better prompt template to help the big model sort out the four content points of the review corpus

The Stanford job introduced in the previous section, which allows GPT4 to serve as reviewers, is quite inspiring for our company's paper review GPT.

  1. Looking at the positive direction, it means that our company is heading in the right direction. At least the valid opinions of GPT4 exceed 50%.
  2. Looking in reverse, it shows that even if it is as powerful as GPT4, the effect of its API is still limited: nearly half of the opinions have not been adopted, which proves the necessity and value of our company's fine-tuning of the review.
  3. The organization of the review corpus is also very important, so that the model can learn in an orderly manner. It is divided into 1, 2, 3 and 4 without confusion, and you can instantly get the logic and meaning behind the review description
    For example, if the review corpus we crawled can be organized into the following four pieces, I think it would be very strong, and the model would learn quickly
    1) Is the result important? Is it novel?
    2) Reasons for acceptance of the paper
    3) Reasons for rejection of the paper
    4) Suggestions for improvement

Regarding the "third major point of organization of the review corpus" above, we (especially Ah Xun, and secondly me) creatively came up with an idea, which is to use prompt templates to let the big model help sort out the review corpus we crawled. Sort out the common review comments from the four aspects mentioned above from the review corpus.

How to design this prompt template? Drawing on the work of Stanford in the previous section,the prompt template can be further optimized as follows based on the Stanford template.

// See you in the "Large Model Project Development Online Camp" for now. As for the update in this article, it will be updated soon.

3.2.2 How to make the reviewed review results more comprehensive: gather more information

We know that there are multiple reviews for a paper, and there are three modes for learning review data.

  1. One is to choose one from more than one
    But there is a problem with choosing one from more than one, that is: if those reviews are not very comprehensive, will it be wrong to choose one from more than one? It damages the richness of review information
  2. One is Duojuyi
    Make a summary of multiple reviews (Axun, I thought of it one after another), which is equivalent to a synthesis. At this time, you can still use GPT 3.5 16K Or use open source models to help aggregate review data

  3. One is multi-round interaction
    This kind of workload is relatively large and is not the first choice

In this way, if the 24,000 paper reviews after the final cleaning are done using the idea of ​​​​multi-to-one, you can directly call GPT 3.5 that supports 16K (after all, 16K is long enough to send all the review data to GPT3 at once. 5 16K), or the open source model allows it to extract 4 key points directly from all review data, which is about more than 24,000 times

3.2.3 Process review data through the final prompt: ChatGPT VS open source model

To sum up, considering the multiple-to-one strategy to process Review data, it mainly puts forward higher requirements for Prompting:

  1. The large model is required to aggregate the views of all Reviews for summary
  2. In order to ensure the unity of the regular review, specific categories (such as novelty, reasons for acceptance, reasons for rejection, suggestions for improvement, etc.) need to be provided to clearly "classify" the opinions.
  3. Emphasize honesty to alleviate illusions and provide "show weakness" options in prompts (such as replying "don't know" or allowing the result to be empty, etc.)
  4. In order to make it easier for subsequent work to obtain the information of interest from the output of the large model, its output format needs to be required.

It is equivalent to us designing the prompt based on the above requirements (The final designed prompt will be developed online in July " Large Model Project Online Camp", as for the update of part of this text, it will be updated in Q1 next year)

After our final prompt is designed, we can then let the large model process the review data through the prompt. So which large model should we choose, ChatGPT or an open source model? To this end, we compared the following three large model

  1. zephyr-7b-alpha
  2. Mistral-7B-Instruct-v0.1
  3. OpenAI has just opened gpt-3.5-turbo-1106, which is the GPT3.5 Turbo 16K in the previous section.

After comparison, it was found that using OpenAI's gpt-3.5-turbo-1106 has a relatively better effect, with stronger capabilities and better effects. In addition, after actual research and judgment, the cost is not bad and not too high.

// To be updated, what is the specific comparison method and which one is more effective? See you in the online camp for now. As for the subsequent updates of this article

However, during actual use, we found that OpenAI has various restrictions on API access and the restrictions are relatively strict (That is, there are multiple layers of restrictions on users:), access often freezes without returning, and no error is reported, so a lot of time is spent on being prompted "Access exceeds limit", and then The process of waiting and repeating access, and then being prompted to exceed the limit, caused more than 2,600 entries to appear in about seven days from 11.24 to 11.30 in 2023 when we first used OpenAI's official interface, and the frequency of subsequent restricted access became increasingly high, which caused a headache... , such as minute-level request limit, per- Daily request limit, minute-level token limit, daily token limit https://platform.openai.com/docs/guides/rate-limits/usage-tiers?context=tier-one

  1. There is really no other way. We found a domestic second-hand dealer and finally adjusted the second-hand dealer’s interface. The second-hand dealer adjusted the OpenAI interface. At this point, issues such as user access frequency restrictions and agents were left to the second-hand dealer. Bian solved it ( We also thought about why second-hand dealers can solve this type of access restriction problem. Based on past experience, we judged that it should be that second-hand dealers have many OpenAI accounts and agents. There are many routes, and unified scheduling and management are implemented. Then when the user calls, the current low-frequency official account is selected to access the official interface, and the proxy is automatically switched from time to time. You must know that when a proxy is used to access OpenAI with high frequency, it is actually possible. will be put on the blacklist, so it is also important to continuously maintain a proxy pool for automatic switching )
    Of course, the interface of the second-hand dealer will be blocked at night (or Other peak periods) sometimes will still return a prompt that access is restricted. At that time, there are more people who should use it. As a result, even the "most frequently accessed" official interface is not accessed less frequently, so it will also be accessed. Limited
  2. In the end, using the second-hand dealer’s transfer interface, more than 9,000 items were sent out in about five days from December 4 to December 8.

// To be updated

3.3 Related work: AcademicGPT: Incremental training of LLaMA2-70B, including paper review function

3.3.1 AcademicGPT: Empowering Academic Research

In late November, A Xun from our company’s second project team (i.e. the project team responsible for GPT review of papers) discovered a>

  1. A team submitted a paper "AcademicGPT: Empowering Academic Research" on arXiv on November 21, 2023. In the paper AcademicGPT is proposed, which is obtained through continuous training based on LLaMA2-70B using academic data
  2. Then the team extended and developed applications in four aspects based on AcademicGPT: academic question and answer, auxiliary reading of papers, paper review, auxiliary generation of titles and abstracts, etc. Due to the functions of paper question and answer, paper abstract, etc. The functions are already very common (for example, chatpaper mentioned in this article and gpt_academic by a team of the Chinese Academy of Sciences have all been implemented through the API of GPT3.5), but some open source tools before the paper review were not effective through GPT3.5. Okay, so since AcademicGPT has the function of reviewing papers and also uses the 70B model, it must be paid attention to, so I studied their papers carefully
    ()Of course, I believe that they will soon pay attention to our company's paper review GPT work, and then improve their training strategies. After all, it is not surprising that peers can learn from each other 

There are two significant differences between them and us. First, they did incremental pretraining on LLaMA (AcademicGPT is a continual pretraining on LLaMA2 < a i=2>), both of them, our company's current paper review GPT is only for the review of English papers ( After all, if the client wants to publish a paper in July, it will be in English EI ei journal Mainly SCI papers, followed by Chinese journals), and they also considered Chinese, so they considered LLaMA2-70B’s limited Chinese ability and academic field knowledge, so they collected Chinese data and academic English data to improve related aspects

  • Chinese data: taken from CommonCrawl, Baike, Books, etc. (in addition, 200K academic texts were crawled from the Internet)
    Since CC data usually contains a lot of advertisements, pornography, etc. Harmful information, so they need to clean the data. Finally, they used LLM and used the prompt shown in the figure below to clean the data taken from the Internet, such as various annotations on the documents
    ( Based on the original text of the paper, we judge that they should first let the model annotate some texts based on prompts given by humans, then ChatGPT annotate the same texts, and finally compare the results between the two. Difference, build a loss function and then fine-tune the model itself. After it is almost done, the model annotates the remaining text)

  • English academic data: Crawl more than 1 million Papers from 200 top universities and 2.26 million Papers from Arxiv (as of May 23). The longer Papers are analyzed using Nougat (), shorter Paper uses the research team's own parser, in addition to more than 48 million free academic articles from unpaywall2, As well as academic-related data in Falcon’s open source data setSame as our company's July 

Based on the 120B data obtained above, they used 192 A100 GPUs with 40G video memory to continue secondary pre-training (I have no envy for all their work, but only They have 192 pieces of A100, which makes me personally envious. I really hope that some big tycoon can solve our company's GPU shortage problem in July, ^_^), and it finally passed in 37 days. The training enables LLaMA2-70B to further acquire the ability to understand Chinese and academic content. The following are more details about the training

  1. And in order to speed up the training process, FlashAt-tention2 (Dao, 2023) is used, which not only speeds up the attention module, but also saves a lot of memory, and realizes the fusion of cuda kernel through Apex RMSNorm ()Apex RMSNorm that implements a fused cuda kernel
  2. Since AcademicGPT is a secondary training model of LLaMA2-70b, it uses some of the same technologies as LLaMA2, including
    RMSNorm (Zhang and Sennrich, 2019) instead of LayerNorm,
    SwiGLU (Shazeer, 2020) instead of GeLU
    For positional embedding, it uses RoPE (Su et al., 2021) instead of Alibi (Press et al. , 2021)
    For tokenizer, it uses BPE (Sennrich et al., 2015)
    and uses DeepSpeed ​​(Rasley et al., 2020) and Zero (Rajbhandari et al., 2020) al., 2020), and their training is based on the gpt-neox (Black et al., 2022) framework, in which we integrate many newly introduced skills. Training on 120B data takes approximately 37 days using 192 A100 gpu with 40GB memory

3.3.2 Paper review: Learn from ReviewAdvisor to summarize the 7 key points of the review (similar to how our company used the work of Stanford to summarize the 4 key points of the review)

Like our company, they collected 29,119 papers and about 79,000 reviews from the same website with paper reviews, and then processed them as follows

  • Paper-side processing: 7115 papers with no content or no reviews were eliminated, and papers that failed to parse were eliminated.
  • Review side processing:
    • Removed Reviews with too many newlines
    • Reviews that are too short (less than 100 tokens) or too long (more than 2000 tokens) are eliminated
    • Reviews that are inconsistent with Decision Review decisions and have low confidence are eliminated
    • It is similar to our company's "Learn from Stanford to use GPT4 as a reviewer, and let GPT4 give review opinions based on the four key points to judge the quality of the paper"
      They refer to It is the 7 key points in "Can We Automate Scientific Reviewing?". After removing the Summary, there are 7 points:
      \rightarrow  1 Motivation/Impact
      \rightarrow  2 Originality
      \rightarrow  3 Soundness/Correctness
      \rightarrow  4 Substantiality Substance
      \rightarrow  5 Reproducibility
      \rightarrow  6 Meaningful Comparison
      \rightarrow  7 Clarity)
      Use the source code of the paper to further annotate the Review, that is, extract the corresponding key points
       Specifically, when they summarize the 7 or 8 key points of a single review, given a review, let the model mark out the key categories to which each token/word may belong one by one, and the model marks the entire review After finishing, extract the content with key points of the annotation results. For example, in the figure below, the model annotated each token one by onesummary (purple), clarity(yellow), substance(orange), etc., and then The colored key points are extracted as a summary of the review, where + indicates positive emotions and - indicates negative emotions

Finally, the paper data + the summarized review data obtained after the above series of combing were used to fine-tune the 70B model

To facilitate your understanding, I would like to add some explanations about this article "Can We Automate Scientific Reviewing?"


In fact, the perspective of this paper is to regard "Review" as an evaluation of the abstract and corresponding content of the Paper to ensure the correctness of the facts. Thereforethis paper considers modeling the Paper Review problem as a summary generation task, using the more advanced BART model at the time (2021) for training, and obtaining the ReviewAdvisor model

Through the designed evaluation system, the following observations are made:

  • Models tend to generate non-factual statements
  • The model has not learned high-level understanding, such as being unable to distinguish between high-quality and low-quality Paper.
  • The model tends to imitate the language style of the training data (tends to low-level patterns), such as easily generating high-frequency sentences in the training samples
  • Can better summarize the core ideas of the paper

The final conclusion is: "Model review cannot yet replace manual review, but it can assist manual review."


There are two notable aspects of this work:

  • Enhance Review data (summarize 8 key points from the review data)
    For the relatively messy Review content, the research team only wants to retain useful "structured" content, so They will start by defining "structured aspects" and take out the corresponding structured content from the Review, thereby achieving data enhancement on the Review side

    1 Define structural aspects
    The research team discussed various aspects that they believe a "good Review" should have, including the following 8 key points: 1 Define structural aspects a> Motivation/Impact(): Summary summarySUM
    Summary(): Motivation/Influence Originality(ORI) : Originality Soundness/Correctness(SOU): Reasonability/Correctness Substance(SUB): Substance Replicability(REPNote that after we have trained a model through the paper and the review points summarized above, we can use the new paper Give this model and let it output review  Invite people with machine learning skills Background personnel check the annotation results5 Manual inspection Use the annotator to process the remaining data After labeling, the results are not completely reliable, and it is necessary to formulate rules or use manual correction of the prediction results of the labeler4 Post-processing Considering that it is not realistic to manually label all the data, use the Review data labeled in step 2 to train an annotator The BERT extraction model is used as an annotator to automatically annotate aspect items in the original Review. That is, input the Review text, and the model will classify and predict the text token by token, and predict which parts of the Review belong to which aspects3 Training annotator is similar to "... ... The results are new[Positive Originality] and important to this field[Positive Motivation] ... ...”which are Summary (SUM), Moti-vation/Impact (MOT), Originality (ORI), Sound-ness/Correctness (SOU), Substance (SUB), Repli-ability (REP), Meaningful Comparison (CMP) and Clarity (CLA)): Meaningful comparison< /span>2 Manual annotation): ClarityCLA Clarity(CMP Meaningful Comparison(): Reproducibility





















  • Generate Review (generate a review with 8 key points when reasoning about a new paper)
    Generate a Review based on the given Paper, and the model selection is BART with a maximum length of 1024 at that time. model, considering that the length of Paper is long, the entire review generation scheme is designed in a two-stage form, that is, first, select prominent fragments from the Paper (input context length compression), and then generate a review summary based on these prominent fragments.

    Select prominent fragments
    Use keywords such as "demonstrate", "state-of-the-art" and many rule judgments on sentences to determine prominent fragments< /span> Based on the basic Seq2Seq model, the process of predicting the output sequence (Review) from the input sequence (Paper) is implemented. On this basis, the research team introduced "aspect perception" to assist the model in prediction, emphasizing the model's output of "aspect points", that is, introducing two multi-layer perceptrons to perform generation tasks respectively: the model must not only generate Reviews token by token content, but also predict the corresponding "key points" token by token

    Training the Aspect-aware Summarizaiton model

    Therefore, the model needs to learn two loss functions at the same time
    \mathcal{L}=\mathcal{L}_{\text {seq2seq }}+\alpha \mathcal{L}_{\text {seqlab }}
    This also means that the model will output 2 sequences in one inference, one of which is the predicted Review content ( The loss function is\mathcal{L}_{\text {seq2seq }}), and the second one is the key point of prediction (The loss function is\mathcal{L}_{\text {seqlab }} )

3.3.3 Reasons why 70B AcademicGPT is not effective in paper review

According to the cases of reviewing some papers shown in the original paper, the effect is not good.

The picture below shows two review cases in the paper

  1. The left part of the picture below is the review case 1 in the paper. It can be seen that it points out the shortcomings of the corresponding paper: "The writing needs polishing. There are too many spelling and grammatical errors. The experimental setting is not convincing enough. . First, no baseline was provided. Second, the author only conducted experiments on a single data set. Third, the author did not report the variance of the results."
    This review comment is not for the author of the paper himself In other words, the reference value may not be much. After all, when you point out that there are too many spelling and grammatical errors, it is best to specifically point out which paragraph of the paper the so-called spelling and grammatical errors are in.
  2. The right part of the picture below is review case 2 in the paper, but the fifth Weaknesses point is "5. The writing of the paper could be improved. For example, the authors should explain what xt,i means in
    Eq. (1) means that the paper should explain the formula (1) x_{t,i} , but the formula (1) of the original paper does not involve x_{t,i}

There are many reasons for the poor performance

  1. Unlike our current entire project team, which is going all out to iterate the paper review GPT, for AcademicGPT, paper review is only one of their four major applications.
  2. The review side of the data set was collected using the early summary extraction model BART, and the results extracted by the early model may not be accurate. After all, it is still incomparable with GPT3.5.
  3. When making an extractive summary of the review content, directly copy part of the original words
    However, reviews are written by a variety of people, and the style of the original words is highly inconsistent, and the model may It will be difficult to converge. In other words, their data is obtained through "extractive summary", that is, as mentioned above: "Given a review, let the model label each item one by one. The key category to which a token/word may belong. After the model annotates the entire review, it extracts the content with the key annotation results

    In short , the difference between them and our company's paper review GPT is that
    they extract the key points using the extractive method, while we extract the key points using the generative method
    \rightarrow  The extractive method extracts The original words, but the wording style of the original words is very different, and the extraction model has limited capabilities and requires a lot of post-processing such as manual verification
    \rightarrow  But for generative expressions, especially LLM generative expressions, they can be generated according to requirements Relatively uniform wording style

Of course, it’s hard to make too many conclusions until they actually open source it for users to use. Let’s wait until they open it to the outside world first.


Part 4 Model training/fine-tuning: from Mistral, Mistral-YaRN to LongLora LLaMA13B

Thesis Review GPT Second Edition When doing model selection, our company considered three candidate models: Mistral, Mistral-YaRN, and Llama-LongLora. The following will introduce these three models one by one, as well as the corresponding training details and final results.

4.1 Mistral 7B: Query attention through grouping + sliding window attention to surpass the 13B model

In May of this year, three former employees of DeepMind and Meta co-founded Mistral AI in Paris ( Its CEO Arthur Mensch previously worked at DeepMind Paris, CTO Timothée Lacroix and chief scientist Guillaume Lample co-participated in the research and development of the LLaMA generation at Meta, much like when some employees of OpenAI left to form Anthropic). In October this year, they released the first large model of the base. i.e. Mistral 7B

According to its corresponding paper "Mistral 7B" (In addition, this is the GitHub address of )

  1. Mistral 7B outperforms the current best 13B parameter model (Llama 2) in all evaluation benchmarks and surpasses the published 34B parameter model (Llama 34B) in inference, mathematics and code generation a>
    Mistral 7B outperforms the previous best 13B model (Llama 2, [26]) across all tested benchmarks, and surpasses the best 34B model (LLaMa 34B, [25]) in mathematics and codegeneration.
  2. The model uses Grouped Query Attention (GQA), which significantly speeds up inference and also reduces memory requirements during decoding, allowing higher batch sizes, thereby increasing throughput < a i=1>GQA significantly accelerates the inference speed, and also reduces the memory requirement during decoding, allowing for higher batch sizes hence higher throughput About GQA, please see "One article covers all kinds of attention: from multi-head attention MHA to group query attention GQA, multi-query attention MQA


  3. Also combined with sliding window attention (SWA) to effectively handle sequences of any length,
    SWA is designed to handle longer sequences more effectively at a reduced computational cost< /span>

Additionally, the authors provide a model fine-tuned for following instructions, called Mistral 7B-Instruct, which outperforms the LLaMA 2 13B-chat model in both manual and automated benchmarks

4.1.1 Sliding window attention: extending context length

The number of operations of vanilla attention is quadratic in the sequence length, and the memory amount increases linearly with the number of tokens. At inference time, this results in higher latency and lower throughput due to reduced cache availability (The number of operations in vanilla attention is quadratic in the sequence length, and the memory increases linearly with the number of tokens. At inference time, this incurs higherlatency and smaller throughput due to reduced cache availability)

To alleviate this problem, we use sliding window attention

  1. Each token can follow up to W tokens from the previous layer (in the above figure, W = 3). Note that tokens outside the sliding window still affect next word prediction
    Each token can attend to at most W tokens from the previous layer (here, W = 3). Note that tokensoutside the sliding window window still influence next word prediction.

    For example, when we face this sequence:The cat sat on the
    If it is standard attention, when calculating the last token "the", you have to calculate the inner product of the query corresponding to the itself and the key corresponding to each token above. When the length of the sequence increases, the calculation amount is still Relatively large
    But if it is a sliding window attention, when calculating the last token "the", you only need to calculate the query corresponding to the itself and the key corresponding to the three tokens above. Inner product (The three tokens mentioned above include the self)
  2. In each attention layer, information can be moved forward by W tokens. Therefore, after k attention layers, information can move forward by at most k×W tokens
    At each attention layer, information can moveforward by W tokens. Hence, after k attention layers, information can move forward by up to k ×W tokens.

4.1.2 Rolling Buffer Cache

A fixed attention span means that we can limit our cache size using a rollingbuffer cache< /span>)

  1. The size of the cache is fixed W, and the key and value of time step i are stored in the cache location i mod W. Therefore, when position i is greater than W, the past values ​​in the cache will be overwritten and the size of the cache will stop increasing
    The cache has a fixed size of W, and the keys and values ​​for the timestep i are stored in position i mod W of the cache. As a result, when the position i is larger than W, past values ​​in the cache are overwritten, and the size of the cache stops increasing

    Take "The cat sat on the mat" as an example..
    When i = 0, it refers to The, 0 mod 3=0
    When i = 1 , refers to cat, 1 mod 3=1
    When i = 2, refers to sat, 2 mod 3=2

    When i = 3, refers to on, 3 mod 3=0
    When i = 4, it refers to the, 4 mod 3=1
    When i = 5, it refers to mat, 5 mod 3 = 2
  2. On a sequence length of 32k tokens, this reduces the cache memory usage by 8 times without affecting model quality
    On a sequence length of 32k tokens, this reduces the cache memory usageby 8x, without impacting the model quality.

If the buffer is compared to a warehouse, every time a new thing is stored, it will occupy the corresponding position, and the total capacity of the warehouse is fixed. When the warehouse is full, the earliest things put in will be removed, leaving New items continue to enter the warehouse, and items whose entry time is closer to the current time will remain in the warehouse. In this way, a certain length of sequence can be retained while saving resources.

4.1.3 Pre-filling and blocking: reducing repeated operations

When generating a sequence, we need to predict tokens one by one because each token is conditioned on the previous token. However, the prompt is known in advance, and we can use the prompt to prefill the (k, v) cache, i.e.

  1. If the prompt is very large, we can break it into smaller chunks and prefill the cache with each chunk. We can do this by selecting the window size as our tile size. So for each block we need to compute the cache and the attention on the block
  2. The image below shows how attention masking works on caching and chunking

    When pre-populating the cache, long sequences are chunked to limit memory usage
    We process a sequence into three chunks, "The cat sat on", "the mat and saw", "the dog go to”. The image above shows what happens in the third block ("the dog go to"): it uses a causal mask (the rightmost block) to focus Self, use a sliding window (center block) to focus on the cache, and don't focus on past tokens because they are outside the sliding window (left block)

4.2 Mistral 7B combined with YaRN

Original paper of YaRN: https://arxiv.org/abs/2309.00071

// To be updated

4.3 LongLora LLaMA13B

// To be updated

Part 5 Evaluation of the model: How to evaluate the effect of GPT review

5.1 How do Stanford researchers evaluate the effectiveness of GPT4 review comments?

As shown below

  • Reviews proposed by LLM and human reviews are submitted to GPT-4 for summary processing using certain prompts.
    That is, LLM is given a task to pay attention to the potential in the Review. The reasons for rejection are provided in a specific JSON format to provide the key issues pointed out by the Review. The research team explains that the purpose of focusing on the key issues is that "the criticisms in the Review directly help guide the authors to improve the paper"
  • Input the content of the LLM Review to be evaluated and the human Review from the previous step into GPT-4, use a specific prompt to instruct GPT-4 to output new JSON content, and let GPT-4 point out two Matching items in the incoming content, and evaluate the matching degree (5-10 points)
    The author found that similar items with 5 points and 6 points have poor confidence, so they set 7 Scores above are considered "matches", and the degree of overlap is calculated based on Hit = \frac{|A \cap B|}{ |A| }, where |A| is the number of criticism items raised by LLM, |A \cap B| Number of matching criticisms made for LLM and humans

5.2 Drawing on the work of Stanford, how does our company evaluate the effectiveness of GPT review?

In the previous section, the work of Stanford researchers on model review effect evaluation seemed perfect, but there was a small problem, namely

  1. Although LLM can return JSON format content based on Prompt requirements according to the instructions, it is not always possible to generate JSON format content that is easy to parse.
  2. But fortunately, JSON mode is provided in gpt-4-1106-preview and gpt-3.5-turbo-1106 versions, which can be passed in in the interfaceresponse_format={"type" ;: "json_object"}After enabling this mode and issuing the "return in JSON format" instruction in the prompt, content that fully conforms to the JSON format will be returned

// To be updated

So far, this article has revealed many engineering details of our company's paper review GPT project. These details are rarely available online. After all, commercial projects, of course, are more in "Large Model Project Development Online Camp"See


References and Recommended Reading

  1. Full-text translation of the paper where GPT4 serves as a reviewer:[Latest research from Stanford University] Using a large language model to generate review comments
  2. GPT-4 became a Nature reviewer? Stanford and Tsinghua alumni tested nearly 5,000 papers, and more than 50% of the results were consistent with human reviewers
  3. Several Chinese interpretations of mistral-7B
    Model architecture optimization from open source LLM - Mistral 7B
    Mistral, the new favorite of the open source community, the best 7B Model
  4. Mistral 7B - the strongest 7B model released by the Mistral AI team known as "European OpenAI"

Create, modify, and improve records

  1. On November 2nd, I started writing this article.
  2. On November 3rd, focus on the second part and the ideas for GPT4 review
  3. On November 4th, focus on Mistral 7B in Part 3
  4. On November 5th, continue to improve the Mistral 7B part
  5. On November 11, this section was updated: "2.2.2 How to make the reviewed review results more comprehensive: gather more together"
    Improve section 1.1.1 Meta nougat
    By the way, I feel very happy about the technical research conducted for the implementation of the project, ^_^
  6. 11.15, added section 2.2: Secondary processing of review data
  7. 11.18, optimize part of the description in Section 2.2
  8. 11.22, added the second part about the details of the data processing part in the first version of the paper review GPT. For example, the paper data processing only removed the reference
    added "3.2. Section 3 processes review data through the final prompt: ChatGPT VS open source model" related content
  9. 11.23, add this section: 1.2 Analysis of 26,000 papers
  10. 11.25, considering that after data analysis, data processing, and model training, model evaluation must be done
    Therefore, a new part is added, that is, Part 5 Model Evaluation: How Evaluate the effectiveness of review GPT
  11. 12.8, because I have to provide internal training for a company in Wuhan, and I will also teach the paper review GPT in the "Large Model Project Development Online Camp", so as the project continues to advance, so Added a new section: "3.3 Related work AcademicGPT: incremental training LLaMA2-70B, including paper review function"
    Supplementary information on how to circumvent various access restrictions imposed by the API when summarizing review data through OpenAI's API
  12. 12.9, focus on optimizing the content of this section: "3.3.2 Paper review: learn from ReviewAdvisor to summarize the 7 key points of the review (similar to how our company used the work of Stanford to summarize the 4 key points of the review)"

Guess you like

Origin blog.csdn.net/v_JULY_v/article/details/134183799