Keyword-BERT——The killer of semantic matching in question answering system

 

Primer

Questions & Answers is a very important way of communication between people. The key is: we need to understand each other's questions and give the answers he wants . Imagine a scene where your girlfriend or wife tells you affectionately the night before Tanabata

Darling, Tanabata is coming soon, can you give me a new phone ?

And you who are addicted to Kings Canyon at this time may answer without thinking

Well dear ~ yesterday to fight just to see a lot of buy one get one free shipping nine nine - it can be cheaper to buy a few oh a shell easily broken Yeah

 

Your voice hasn't fallen, a lore is coming

(Wang Sledgehammer, Zou, at the age of 28)
Therefore, for such propositions that can be seen everywhere in our lives, as long as we cherish our lives & take some snacks, we will not easily lose points. But for the machine, this is a huge challenge, because the machine's misunderstanding of similar text is very common, so our AI is often joked by users as artificial mental retardation (a term that sounds very lacking in AI). As the man behind AI, we have been committed to improving AI's ability to get the machine out of the IQ dilemma as soon as possible. Specifically, for this Q & A scene, we have proposed a new set of methodology and killer models, so that AI can understand you better and stay away from sending propositions ~

background

In daily life, we often ask our voice assistant (Xiaowei / Siri / Alexa / 小 爱 / 小 度, etc.) a variety of questions, one of which is relatively rigorous, and the returned answer needs to be Precise, like

"What is the height of Yao Ming's wife", "In what year was Jay Chou's rice fragrance released? Which album is included?

We call this kind of problem precise questions and answers , can make use of the knowledge spectrum technology, resolution issues in each component (entity / entity relationship, etc.), rigorous reasoning, and returns the answer. (We also have accumulated in the question and answer of the atlas, we have the opportunity to share it again, this article does not show it first) There are also a type of questions, either asking for a variety of methods, or open answers, such as

"How to make omelet rice", "The basketball level of cxk under evaluation", "How much alcohol can burn?"

For questions and answers such questions, we will call open domain Question Answering . This kind of problem is difficult to analyze the sentence components strictly for reasoning, or can not give accurate answers, so generally by looking for similar problems to save the country. The general process is as follows


First of all, we need to maintain a massive and high-quality question and answer library. Then, for the user's question (Query), we first roughly retrieve more similar questions (Questions) from the question and answer library, for these candidate questions, then further "semantic matching" to find the most matching question, and then it corresponds to the answer back to the user, thus completing the "open domain Question Answering" we can see a rough retrieved Question, which a lot of noise, compared with our Query, many of which are the shape and God does not like . So the most core module is the semantic matching of Query-Question , which is used to find a Question similar to Query from a pile of similar candidate questions . Once mismatch, it may fall into a mobile phone and mobile phone shell danger, ranging from churn, while AI plane crash death.

Challenges & Current Solutions

Solving the semantic matching of open domains is not easy, and its challenges mainly come from the following two aspects:

For the second point, the problem is sensitive to key information , we can look at some cases. The following False Positive case looks similar but not similar, and is misclassified by the model, while the Fasle Negative case is similar but not similar, and is also misclassified by the model.

The bold blue words represent the key information of the model's self-perceived matching, and the red represents the key information that actually needs to be matched, but the model is mismatched. In order to solve the problem of open domain semantic matching, the industrial academia can be described as eight immortals crossing the sea, each showing their magical powers. . In general, it can be seen as solving problems from two dimensions of data and model .

Data dimension

The positive samples of training data (that is, pairs of similar problems) are generally manually labeled, while the generation strategies of negative samples (that is, pairs of dissimilar problems) are subtly different. The simplest and crudest is random negative sampling, that is, to give a problem, find a problem from a large number of other problems, and combine it with it to form a negative sample. But this negative sample is obviously so easy for the model, and it cannot train the model well. So go find it difficult to distinguish the truly negative samples (we call confusion sample ), so as to enhance the ability of the model.

It can be seen that there is currently no optimal strategy to obtain such high-quality data, and more or less must be added manually. In essence, semantic matching models rely heavily on tagging data, which is actually a data point of pain .

Model dimension

The better known improvement is to start with the model. Academia and industry have an endless stream of remodeling semantic matching models every year, and indeed solve some of the problems they claim. Here we list some of them:

Although there are many types of these models, from the perspective of model structure, there are no more than two categories: representation-based and interaction-based . The representation-based model firstly represents the query-question separately, and then interacts at the high-level level, which is representative of DSSM and ArcI. The interaction-based model allows query-question to interact with each other at the bottom, which is representative of Bert, ArcII, MIX . The difference between different models is nothing more than the difference of internal modules (RNN, CNN, Transformer ...), nothing more than this in the big frame.

This article does not intend to discuss the pros and cons of the two major types of models. In this regard, the discussion has long been ahead. Our focus is on:

Can these models really solve the two challenges of open domain Q & A: wide coverage and sensitive key information ?

From the evaluation results of these models, the answer is: no .

As for the deeper explanation, I think it is still constrained by the data. The so-called data determines the upper limit, and the model is only close to the extent of this upper limit. If we can't provide enough training samples to teach the model to discern the key information , just rely on the model's own fancy CNN / RNN / Attention, even if we can do everything, it may not work in some difficult cases. In the prediction stage, given the wide coverage of the problems in the open domain, it is easy to appear problem pairs that have not appeared in the training samples (ie Out-Of-Vocabulary, OOV problems), the key information in the main problem (similar / not similar) Similar word pairs have not appeared, at this time the model can only blind.

Summary of pain points

In summary, although the great gods of industry and academia continue to shine in this field, but we still face two major pain points in the open domain semantic matching scenario:

  • Data pain points : The model relies on high-quality data annotation

  • Model pain points :

    • The model is unable to capture the key information of difficult samples

    • The model does nothing for OOV like / dissimilar words

Tao: Methodology

In order to fundamentally solve these two major pain points, we are no longer confined to the technical level, and do some small improvements in data sampling and modeling, but first think deeply about the root of the problem and propose a methodology from the Tao level. As follows:

We are committed to the framework of the traditional model of a semantic matching two improvements, is the inclusion of a keyword system , extract keywords / phrases from a massive open field, and then to the training sample keyword / prediction samples that appear, adding extra A callout. Another point is to make corresponding improvements to the model to enhance the model's capture of this key information. The core of these two changes is to explicitly introduce key information for data and models, so that we can fundamentally solve the pain points of the data and models we face, and it is no longer just scratching the ground.

Why does this solve the problem? And listen to decomposition.

Interpretation

In order to facilitate everyone's understanding, we will explain our way one by one in combination with specific cases.

1. Improved model: strengthen the model to capture key information

This is easy to understand. In our model, we have added additional processing for keyword pairs, which is equivalent to adding additional features to provide more information to the model and strengthen the model's ability to distinguish between problem pairs. As for the specific improvement details, we will mention it in the next section.

2. Samples with keywords: reduce dependence on labeled data

Let us take an example, which is also a negative sample that we mentioned in the introductory part: how to scan the code plus WeChat and how to scan the code into the WeChat group . The root cause of the dissimilarity of these two problems lies in the difference between WeChat and WeChat groups . But what the model learned at the beginning may be the difference between the two verbs of adding and entering (because the embedding of WeChat and WeChat groups may be very close), only we provide additional samples, such as telling the model how to add Douban group and how to enter Douban The two problems of the group are similar, the model may learn that access and addition are not the key, and then learn the real key information. So if we mark the keywords at the beginning, it is equivalent to telling the model that these are the candidate and possible key information, and the model (after our improvement) will consciously learn about this part, without the need to pass it by yourself. More samples to distinguish, so as to fundamentally solve the dependence on the target group data. Our results also corroborate this point, first posted in advance, the following figure is the traditional bert model and our modified keyword-bert model, the amount of data required to achieve similar accuracy, the specific we will elaborate in the next section .

3. Samples with keywords: a priori information in the open field, reducing the training set OOV

We still give an example, how to scan the code plus the QQ group and how to scan the code into the WeChat group . In the training sample, the QQ group may never appear in a problem pair with the WeChat group (also known as the so-called of the OOV), but if at the time of prediction, we marked the additional micro-channel QQ group and group keywords are equivalent to give a priori information , the model (through our modified) able through its own keyword module, Specially study the similarities / differences of these two words to get a better classification result and reduce the negative impact of OOV.

Technique: Realization

After the Tao level is clearly explained, everything is suddenly bright, and the rest of the implementation is very natural. There are only two improvements around our traditional framework:

  • How to construct a keyword system ?

  • How to improve the model ?

There is no standard answer on the specific implementation method. For example, the keyword system, as long as it can extract a large number of high-quality keywords in the open domain, it is a good system; for example, the improvement of the model is not limited to the Fastpair and BERT that we have improved. Similar ideas can actually be transferred to most of the known models in academia / industry, but we will still show you our specific implementation without reservation, for reference, to throw the bricks.

Keyword system

As mentioned above, a good keyword system needs to be able to extract many and good keywords -that is, a large number and high quality.

To achieve this goal, we have introduced the field concept, just fit our open domain Question Answering characteristics - involving many fields, covering a wide range So let's get a flood of news / articles with field labels, through various means from Candidate keywords are extracted. Then I designed a diff-idf score to measure the domain characteristics of this keyword. Intuitively, the frequency of documents that this keyword appears in its own field is much higher than that of other fields. After being truncated by this score ranking, post-processing is performed to remove noise, entity normalization, etc., and finally, together with some public entries, constitute a huge keyword dictionary. The specific process is as follows (more detailed but indispensable).

This process is running and updated every day. Our current number of keywords reaches millions and the quality of manual evaluation is also good. Here are some case shows:

Model evolution

Similarly, the model must be upgraded accordingly. Our model evolution route is shown below

First of all, we made keyword improvements for the Fastpair of the previous online run, then we changed the shotgun to BERT to deal with more complex business scenarios, and also made improvements to BERT, which we called Keyword-BERT From the index point of view, this is a killer model , which has achieved a qualitative leap in matching quality, and we will elaborate next.

Improve Fastpair

The model structure of Fastpair is as follows:

It is actually a modification of Fasttext to adapt to the scenario of text classification. Because Fasttext is for single text classification, and to classify text pairs, it is obviously not enough to use only the n-gram features of the two texts, so it is natural to add a pair formed by combining the words in the two texts. -wise interactive features, this kind of thinking is actually similar to those of the "interaction-based" model we mentioned at the beginning of the article. First, fully integrate the information of the two texts and then classify. Then our problem is How to transform the Fastpair model so that it can additionally "focus" on key information? Our change is very intuitive, that is, the pair-wise feature that contains the keyword, plus an additional learnable weight, as follows:

Here we draw on the idea of ​​parameter decomposition in FM and decompose the isolated Wkq into the embedding inner product of two words, which can not only reduce the amount of parameters, but also describe the commonality between pair-wise features containing similar keywords. After about 60w, Baidu knows that the problem pair (the ratio of positive and negative samples is 1: 1) is used for training, and then 2k difficultly divided positive and negative samples are manually marked for prediction. From the perspective of prediction indicators, the improvement is very significant.

However, due to the inherent problem of the shallow layer of Fasttext model, Fastpair's accuracy is not high, and the pair-wise features of OOV are helpless. When the business scenario is facing greater challenges, we need to consider upgrading our arsenal.

Keyword-BERT

Compared with other known depth models, BERT is an improvement of the level of nuclear bombs, so we chose it for granted (in fact, we also did offline experiments, and the results were all expected) Since the structure of BERT is well known, we will Without going into details, what we focus on is how to add an additional key information capture module to BERT? Our ideas are in line with the improvement of Fastpair, but this pair-wise interaction has been turned into an attention mechanism. The details are as follows:

On the one hand, we introduce an additional keyword layer in the uppermost layer. Through attention and mask, we specifically focus on the keyword information between the two texts to enhance the mutual information between them . On the other hand, for The representation of the two texts output, we use the fusion idea in machine reading comprehension to fuse, and then the fusion result is output with the CLS to the classification layer. Through such transformation, Keyword-BERT index under the number of different layers Both are superior to the original BERT.

We found that the smaller the number of layers, the more obvious Keyword-BERT is compared to the original BERT. This is also easy to understand, because the fewer the number of layers, the less sentence-level information BERT can learn, and the keywords are equivalent to supplement this sentence-level information. The last thing we launched was the 6-layer Keyword-BERT , Because its performance is very similar to the original 12 layer BERT, and the inference speed is much faster (under our internal self-developed BERT acceleration framework).

extend

Model structure attempt

The structure of Keyword-BERT given in the text is our best practice on many trial and error, we also tried:

  1. Directly replace the original BERT layer 12 layer with keyword attention layer : The effect is not good, because the keyword can only be used as additional supplementary information, rather than replacing the original semantic information.

  2. Add Keyword attention layer to the bottom layer of the model : the effect is not good, because the key information is gradually weakened during the "propagation" of the bottom layer information to the upper layer.

Future job

Keywords only provide information in one dimension. We can also add richer information (such as part-of-speech of words, graph attributes of words, etc.) to enhance the model's ability to distinguish. The model framework can still use our existing structure.

The original paper and source code can be seen :https://github.com/DataTerminatorX/Keyword-BERT

Published 45 original articles · won praise 2 · Views 5228

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/105182946