GLM has been greatly enhanced, and the Tsinghua team has launched an enhanced version of GLM networking, WebGLM!

Original Author of Xi Xiaoyao's Science and Technology
| Xiaoxi, ZenMoore

Large models generate unreliable answers? A very direct way of thinking is to combine the "knowledge" of traditional search engines to enhance the retrieval of large models .

In fact, long before InstructGPT came out, OpenAI released WebGPT, a model that can be used for aggregation of search results. Based on GPT-3, WebGPT tries to imitate human "search behavior" to use search engines to obtain aggregated search answers. Very good results were obtained in the question and answer .

Drawing on the idea of ​​combining WebGPT with search engine capabilities, the team of teacher Tang Jie from Tsinghua University connected the network cable to ChatGLM, the current domestic leader in open source large-scale models, and launched an enhanced version of ChatGLM, WebGLM, as an enhanced version of Q&A based on GLM-10B System, WebGLM can more accurately and efficiently complete the tasks of question answering and retrieval, and even in the experiment, it can approach the performance of 175B WebGPT with 10B parameters :


At present, WebGLM has announced the code address as follows, friends who want to experience it can click the link~

Thesis title:

WebGLM: Towards An Efficient Web-Enhanced Question
Answering System with Human Preferences

Paper link:

https://arxiv.org/pdf/2306.07906.pdf

Project homepage:

https://github.com/THUDM/WebGLM

Large model research test portal

GPT-4 capability research portal (advanced/continue to visit in case of browser warning):
https://gpt4test.com

An official usage introduction is shown in the figure below:

For example, if you ask when the epidemic will end, WebGLM will cite different web links to answer the question. You can see that the reply is quite professional, and it also lists real reference links, which greatly enhances the accuracy of the model reply. credibility.

And look at another example, regarding the more "soft" problem "how to balance work and life", WebGLM can also handle it very well.

Benchmarking WebGPT, a network-enhanced question system generally involves three components, namely Retriever, Generator and Scorer . Retriever mainly uses a large model as an enhanced retriever. In the entire WebGLM, the use of the retriever is divided into two stages:

  • Coarse-grained search : The whole is divided into three stages of search, acquisition and extraction. The search uses the questions entered by the user and uses Google API to obtain the URL of the main candidate web page. In the acquisition stage, the corresponding HTML content is crawled in parallel according to the obtained URL. In the extraction stage, the text content of the page is divided into a list of paragraphs based on HTML2TEXT.
  • Fine-grained search : In coarse-grained search, there may still be a large amount of content irrelevant to the search problem, so WebGLM's pre-trained Contriever retriever and ChatGLM "purify" the content of coarse-grained search.

In the whole process, time is mainly consumed in the steps of fetching web pages, so WebGLM greatly speeds up the loading time of pages by using parallel asynchronous technology.

The Generator part is mainly responsible for generating high-quality question answers from the reference pages obtained by the retriever, which is also the core function of the Web-based enhanced GLM. In WebGPT, OpenAI hired a team of full-time experts to construct a triplet dataset containing questions, answers, and valid reference links, while in WebGLM, the author team used the contextual learning capabilities of large models to construct a dataset containing Question answering dataset WebGLM-QA with 83,000 filtered data and 83,000 unfiltered data.

The data set generation applies the excellent context learning ability of the large model. The author calls it Bootstrapped Generator, and the step-by-step method is shown in the figure above. The generation is mainly divided into three stages: Prompt Formulation, Instruction Inducting and Few-shot In-Context Learning. In Prompt Formulation, the author compared several prompt methods to determine the optimal Prompt. In Instruction Inducting, the author adopted LLM self-design instructions method to generate answers to questions, and in Few-shot In-Context Learning, a learning method is used to select a display for reasoning and complete the construction of the data set.

Finally, in order to align with human goals and preferences, WebGLM built the Scorer part, which scored the answers generated by WebGLM by using reinforcement learning of human feedback, and fine-tuned the model and discarded some samples according to the score. The overall model architecture is shown in the figure below:

In the experimental part, the assessment is done mainly by the answers and reference links. In the answer evaluation part, the six indicators of fluency, correctness, authenticity, objectivity, redundancy and citation accuracy are mainly used for evaluation; in the reference link evaluation part, relevance, density, authenticity, toxicity and five indicators of social deviance were evaluated .

Scored by 15 human experts on 272 questions, you can get:

Compared with WebGPT-175B, WebGLM is only slightly inferior, but its performance is much higher than Perplexity.ai and WebGPT-13B, and even achieved the highest performance in terms of fluency, authenticity and redundancy, and scored in the accuracy rate Approached in WebGPT-175B .

In addition, in order to test the quality of WebGLM answers, the author scrambled the answers generated by WebGLM, WebGPT-175B, WebGPT-13B and Perplexity.ai, and then mixed them with human-written answers to find real human evaluators to evaluate the quality of the answers. The evaluation is carried out in the form of "challenge competition", directly comparing the pros and cons of the answers A and B, and constructing a "Turing test" for question and answer generation. The results show that WebGLM also has a 43% winning rate against humans, which is almost tied with WebGPT-175B's 45% winning rate .

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/131314786
Recommended