Application of intelligent question and answer technology in Baidu search

Author | Xiaodong

Introduction

This article mainly introduces the application of intelligent question and answer technology in Baidu search. Including the development history of machine question answering, generative question answering, and Baidu search intelligent question answering applications. Everyone is welcome to join the Baidu search team and jointly explore the development direction of intelligent question and answer technology. There is a resume submission method at the end of the article.

The full text is 6474 words, and the estimated reading time is 17 minutes.

01 What is machine question answering?

Machine question answering is to allow computer software systems to automatically answer descriptive questions raised by humans. For example, if we ask: "What is the name of the program hosted by Wang Xiaoya?" we can enter any question described in natural language in the Baidu search box, and get the relevant answer directly in the first result of the search, as shown in the following figure:

picture

Different from traditional search engines that retrieve web links based on multiple keyword feedback, machine question and answer directly obtains answers based on questions described in natural language, which can greatly improve the efficiency of everyone's information acquisition. Machine question-and-answer is everywhere in life. According to statistics, about 40% of search needs and about 30% of conversation needs are related to machine question-and-answer.

So, what is the current status of machine question answering applications in Baidu Search? At present, the first result can directly meet most of the question and answer needs, and in Baidu search, there is no limit to the user's question field. It is an open question and answer system that can ask for any information.

1.1 The development history of machine question answering

The development history of machine question answering is as follows, which is consistent with the development of machine learning.

picture

From the perspective of the development of model methods:

Before 2013, everyone mainly did some work related to feature engineering, that is, given a question and some candidate answers, designing a variety of literal matching features, and calculating the word matching degree between the question and the answer, such as BM25 and other algorithms.

From 2014 to 2015, with the development of deep learning, everyone will use neural networks to calculate the semantic distance between questions and answers, such as CNN, RNN, etc.

From 2016 to 2017, everyone will use the Attention network structure to design various model structures to further characterize the deep semantic matching relationship between questions and answers.

From 2018 to 2021, research will mainly focus on training models, and some larger and better pre-trained models will be used to complete complex question and answer matching tasks.

Starting in 2022, everyone will pay more attention to the application of generative models.

From the perspective of the development of the data set:

In 2013, MCTest appeared, mainly in the form of multiple choice questions and cloze.

In 2016, SQuAD was born. This is the first large-scale reading comprehension data information that extracts answers from a provided article based on user questions.

In 2017, Baidu released the DuReader data set, which is the first Chinese reading comprehension data set.

In 2018, HotputQA and others were released to conduct more in-depth research into complex question and answer scenarios such as multi-hop reasoning and common sense reasoning.

1.2 Machine question answering modeling

The current mainstream paradigm: Retriever + Reader

Retriever = query candidate based on query. That is, given a query, obtain relevant candidates for the query, which may be web pages, videos, tables, knowledge graphs, etc.

Reader = Get answer information from a given candidate. That is, based on the given candidates, the answer is further extracted based on the query.

Baidu search is a very powerful Retriever, which can provide relevant candidate queries, so our research work is more focused on Reader, that is, how to better complete answer extraction based on search results.

picture

The early Reader was mainly based on the traditional feature engineering method and was a very complex and systematic pipeline process: first analyze the query to obtain the expected answer type, entity information, question type, etc., and retrieve several candidates from the candidate library based on this information. And design complex matching features to calculate the correlation score between query and candidate, and design a sorting function to sort, and get the highest ranked answer. The process is as follows.

picture

This process is pipelined in series, and errors accumulate at each step. The entire training process cannot be iterated as a whole, and the maintenance cost is high. Later, everyone hoped to find a more end-to-end method to solve the above problems, and Machine Reading Comprehension (MRC) was proposed.

The definition of the MRC task is: input Question+Document, directly replace the complex process with a model, and output Answer. Early MRC work would design some relatively complex network structures to model the relationship between questions and answers. A more classic method is BiDAF. Its input layer maps the entire document and query to enbedding representations, and each learns the representation of the question and document context through LSTM and other networks. Then, through the Attention interaction layer, bidirectional attention is used. The relationship between query and document is modeled, and on this basis, a richer context representation is obtained through the LSTM network. The final output layer predicts the probability of each position starting and ending as the answer, and the fragment with the highest probability is extracted as the answer.

picture

The early model structure design showed a state of blooming, in order to better solve the problems and model the answers.

Later, pre-training models gradually developed, and everyone realized that complex model structure design is not necessary. Transformer is the best model structure so far. This can release more research energy into pre-training work and pay more attention to pre-training. Training task design, loss function, pre-training data, etc.

In this case, a variety of pre-training models have been produced, such as the earliest BERT and Baidu's ERNIE . These pre-training models will make MRC simpler. Everyone will input the query and document as a sequence. Query and document Documents can be separated using some special symbols. After semantic representation modeling of the pre-trained model, the starting and ending positions of the answer are still predicted and extracted.

picture

02 Generative Questions and Answers

The recent development of generative technology is very hot, and a lot of work has been published.

A relatively representative early generative reader was S-NET in 2017, which was specially designed for the MS-MARCO data set. The characteristic of this data set is that the answers come from multiple articles and are not necessarily the same as the vocabulary in the original text. .

For such a task, the natural idea is to use a generative approach to solve this problem. It designs a two-stage process. The first stage is the answer extraction model, which is very consistent with the model we introduced above. It also introduces an additional passage ranking task to rank candidate articles for relevance. The second stage is to generate the model, input the extracted results, and generate a summary of the answers, as shown in the figure below.

picture

It can be seen that these early works are very similar to the generative question and answer process we use now. We will also add a retrieval module, which is the Retriever we just mentioned first, and then candidate extraction, sorting, and generation. However, this work still relies on additional information to make reference summaries. You may wonder, is it possible to have a generative model that directly generates answers without relying on us to input additional information and knowledge?

The T5 model in 2019 first solved this problem. At that time, it adopted a "pre-training + transfer learning" idea to unify different NLP tasks under the generative paradigm to uniformly complete question and answer, machine translation, sentiment analysis, and dialogue. and a series of tasks, and directly answered questions through the knowledge stored in a large model with tens of billions of parameters (which was relatively large at the time). It also verifies the structure of different generative models, including Encoder-Decoder, Decoder-only and hybrid.

However, although models such as T5 can complete some simple questions and answers, they are not enough to reach a commercial state that can be directly used. There is still room for improvement in its parameters and training methods, and it cannot directly achieve very good results for some general problems. Until the emergence of ChatGPT, it will adopt a larger parameter scale (hundred billions) and have stronger human reply alignment capabilities to understand user instructions and complete more complex questions and answers. It can be said that ChatGPT is a conversation and question and answer product that has reached commercial level.

03 Intelligent question and answer application for Baidu search

The Q&A scenarios of Baidu search are rich and diverse. There are also many ways to extract answers. For example, we can obtain some knowledge graphs from encyclopedias or web pages through information extraction, and extract answers on the knowledge graphs; a more general way is to directly extract answers from web page text through reading comprehension; You can also further extract information from some semi-structured data, such as tables, and organize it into a more structured way to display it. Not only text, but also the understanding and extraction of video content.

picture

Facing such a rich and diverse question and answer scenario, what challenges do we face?

Challenge 1 : Does machine question answering face difficulties in complex semantic understanding, reasoning, and context modeling?

Challenge 2 : Faced with the high traffic of search and the demand for complex models in machine question answering, how to achieve rapid response?

Challenge 3 : In an open-field search scenario, web page data is very complex, and the quality of answers varies (errors, one-sidedness). How to provide correct and high-quality answers?

3.1 Solve the difficulties of complex semantic understanding, reasoning, and context modeling

For example, in the first example, as shown in the picture below, the answer mentions a "she", which requires reference resolution and understanding of the context, and the context may be very long. Only through in-depth understanding can we know what is needed. It's a quiz show, not another show. The solution to this problem relies on some very complex models.

picture

The solution we adopt is "large model + pre-training".

In pre-training, we will use very rich data, including several stages:

  • First, use T-level general text to perform Pretrain to learn the basic language model;

  • Moreover, hundreds of gigabytes of business logs are used for Post-pretrain to achieve domain and target migration;

  • In addition, we conduct detailed data mining and conduct Finetune fitting business effects through G-level manually annotated data;

  • Finally, closed-loop feedback of data and models is achieved through remote supervision of data enhancement, annotated data quality identification, automatic mining of weak data and directional annotation, and user behavior guidance.

And in terms of large models:

  • Use tens of billions of parameter models to improve knowledge memory and language understanding capabilities

  • Fully understand context with long sequence modeling

For example, we are using a model we call the DocMRC model, which simulates people answering reading comprehension questions and reading the entire article. The logic is as shown in the figure below.

picture

The input layer supports long sequence modeling, segmenting the entire doc segment sents; in particular, we insert token representation before each sentence, CLS is used to aggregate the representation of each sentence, and the overall input shallow word-level model structure is Learn a local representation; learn deep contextual relationships through a hierarchical structure based on this representation; finally output the CLS special token to represent the annotation and output the answer.

The output layer will have two outputs: one is to output a partial summary of the question and other multi-sentence answer introductions. The output of the sentence layer will be used, and then the output of the sequence annotation will be made; the other is to emphasize the key content in the answer, which may be Several entities will use token representation for sequence annotation prediction.

3.2 Improve the speed of the overall model and achieve rapid response

The daily user traffic of search is very large. As mentioned earlier, we need to use larger or more complex models, and the time and resource consumption of the entire model is also very large. So, are there any other ways to improve the speed of the overall model and achieve rapid response and resource balance?

The hierarchical modeling just introduced is a solution to optimize the model structure.

There is another general method: knowledge distillation. Knowledge distillation is to refine the knowledge of a large model into a single small model, improving the reasoning speed when the effect is similar. Here we adopt a "multi-teacher multi-stage distillation" model.

For the question and answer business scenario, we will train multiple different teachers and improve the upper limit of learning goals through the integration of different teachers. Then for multiple teacher distillation, a baseline solution is to directly average the score or loss weight of each teacher and let the student fit it, but we believe that this method may not ensure the ultimate effect. We expect to make dynamic selections based on different samples (because different teachers have different focuses), so we designed a multi-stage distillation model in which teachers are dynamically selected based on the data, as shown in the figure below.

The first stage is Teacher model training, training multiple teachers to increase the learning upper limit;

In the second stage , unsupervised distillation, it is difficult to judge the quality of the teacher based on unlabeled data, so inter-teacher voting is used to dynamically select teachers based on the gradient direction and eliminate possible noisy teachers;

The third stage is supervised distillation, where teachers are dynamically weighted based on labeled samples.

picture

Through such a multi-stage multi-teacher distillation method, we finally obtained a student model with very good results, even better than a single large model.

3.3 How to provide correct and high-quality answers

The question and answer data in the search scenario is very complex, and the quality of the answers is also uneven. Many web pages may contain some wrong information or one-sided introduction. How to provide correct and high-quality answers is the third challenge we face.

Shown below is an example of a complex answer to a scenario in search. There are lengthy answers on the left, and users cannot quickly grasp the key points. In this case, a way to summarize is needed so that users can quickly understand the key information of the answer and improve satisfaction efficiency. The extractive answer extraction method is no longer sufficient. We need to use generation technology to compress and summarize the answers in depth.

picture

In addition, the answers extracted from a single article may not be comprehensive enough. We need to summarize the answers from multiple web pages and also need to generate a model, as shown in the figure below. We summarize answers from multiple articles and mark the sources in the answers, so users can clearly see where the answers come from.

picture

In summary, if you want to generate comprehensive, efficient, and correct answers, you need a better generative model. There are currently many large language models, but what kind of large language model can complete the question and answer task of search scenarios?

04 Search enhanced generation

Currently, there are several problems with large language models directly doing question and answer:

First , it is difficult for the big prediction model to remember all the knowledge, and some long-tail knowledge may be wrong or unknown;

Second , the knowledge of large language models is easily outdated and difficult to update, and new knowledge cannot be perceived in a timely manner;

Third , the output of large language models is difficult to verify. Currently, users have a poor sense of trust. We cannot fully trust the answers directly generated by the generative model.

So in this case, everyone hopes to have some ways to perform some auxiliary answer verification.

4.1 Search enhancement generation process

For search question and answer scenarios, we have designed a retrieval enhancement generation solution, which has been implemented in Baidu search. Retrieval enhancement generation is based on search engines to supplement relevant information, which can effectively alleviate the illusion of large models and improve the correctness, timeliness and credibility of answers. The overall process is divided into several stages:

1. In the document retrieval stage, multiple reference sources will be retrieved;

2. In the answer extraction stage, key information will be extracted from the article to reduce the burden on the generated model;

3. In the prompt composition stage, questions will be answered based on the obtained reference sources and specific requirements will be provided, such as serial numbers indicating the sources in the answer content;

4. In the answer generation stage, the prompt is input into the large model and the search results are finally obtained.

picture

As shown in the picture above, you can see that the answer on the right is a summary of multiple articles, and the reference source will also be marked in it. This is the answer we expect to provide users.

4.2 Generate large model training process

Our training process for generating large models is divided into four stages, as shown in the figure below. The first two stages are relatively close to the current mainstream generative large model training. In the last two stages, we have made special adaptations in the retrieval-enhanced generation question and answer scenario. .

picture

In the first stage , general pre-training, we will have some general web page corpus and vertical corpus, such as books, tables, conversations, etc., to obtain a general pre-training basic model;

In the second stage , fine-tuning the instructions, I will provide some general instructions so that the model has the ability to understand the instructions;

The third stage is to mark the business instructions and use them to make specific fine-tuning so that it can understand the question and answer scenario of multi-result organization in the search scenario;

The fourth stage is to make detailed fine-tuning based on user behavior feedback and improve the quality of generated answers through reinforcement learning and other methods.

4.3 Learn complex instructions through instruction disassembly

The search business scenario instructions are very complex, and we will make very specific requirements and provide reference sources. So how do you get a generative model to understand such complex instructions? One solution is to annotate many of these complex instructions and feed them into a generative model, but this approach is not necessarily optimal. If the model learns too many such instructions, it will not be able to achieve better generalization effects, resulting in a decline in model performance. Is there any other way?

Here, we draw on the idea of ​​chain of reasoning (CoT) and propose a method of disassembling instructions to learn to retrieve complex instructions in the generation scenario.

The complex instructions above can usually be accomplished in three simple steps:

The first step is to select search results that can be used to answer the question;

The second step is to organize and generate answers based on the selected search results;

The third step is to add reference sources in numbered form.

It can be seen that for very complex instructions, we can disassemble them into multiple simple instructions through multi-steps. We will let the model learn and understand the simple instructions first, and then the model may not need too much data from the complex instructions. The performance on complex instructions reaches a very good level.

4.4 Inference acceleration and resource consumption reduction

For some discriminative models, distillation or some other techniques can be used. However, for generative models, small model size has a greater impact on the effect. Distillation is not particularly suitable and some other acceleration methods are required. There have been many related work studies in the industry recently, such as Inference with Reference, which is a business scenario generated for retrieval enhancement. By detecting a fixed prefix, copy a fixed-length text from the reference as a candidate sequence, and verify that if it is consistent with the model output, parallel decoding can be achieved. step, as shown in the figure below.

picture

In addition, there are some more general generation acceleration methods. For example, you can use a small model to quickly generate multiple steps, input the prediction results of the small model directly into the large model, and the large model verifies whether the decoding is consistent. Similar to the previous work, acceleration can also be achieved, but the requirements are It is to try to make the effect of our small model and the large model as close as possible, the probability of accurate prediction will be greater, and the acceleration ratio will be greater.

Finally, I leave you with a question for everyone to think about: "What will the next generation of search engines look like?" I look forward to your answer and welcome to discuss it with us.

Communication & resume delivery email: [email protected]

——END——

Recommended reading

Support OC code reconstruction practice through Python script (1): module calling relationship analysis

CVPR2023 Excellent Paper | Analysis of the Problem of Lack of Generalization in AIGC Forgery Image Identification Algorithm

Complete the design and development of exclusive codes in one article

AI native application speed guide

Introduction to the practical application of code understanding technology

Lei Jun announced the complete system architecture of Xiaomi's ThePaper OS, saying that the bottom layer has been completely restructured. Yuque announced the cause of the failure and repair process on October 23. Microsoft CEO Nadella: Abandoning Windows Phone and mobile business was a wrong decision. Both Java 11 and Java 17 usage rates exceeded Java 8 Hugging Face was restricted from accessing. Yuque network outage lasted for about 10 hours and has now returned to normal. The National Data Administration officially unveiled Oracle. Launched Java development extension for Visual Studio Code. Musk: Donate 1 billion if Wikipedia is renamed "Weiji Encyclopedia" USDMySQL 8.2.0 GA
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/10123217