Text-matching direction related summary (data, scene, paper, open source tools)

Motivation

Not long ago, on the eve of a small know almost wrote an answer "NLP What independent research direction" , so there are a lot of small partners to ask classify and match references, and in view of the information super text classification has more, do not write friends (but before writing you can see in this article are relevant for classification tricks "text classification important tricks summary" ). As the scene matching more, no more related articles, this article is committed to summarize the relevant information on text-matching problem can punch your friends.

Text matching is a very broad concept, as long as the purpose is to study the relationship between the two texts, basic can put this issue as a text matching. Due to the different scenarios of "matching" is defined very differently, and therefore the matching text is not a complete independent research. However, there are a considerable number of NLP tasks can be modeled as a text matching problem, when they modeled as text matching, of course, will find the model structure, training methods are highly highly similar, but subtly different. So this problem although running a simple baseline, but the specific match in question is not easy to do (especially before there BERT).

Here to talk about the specific contents of the can punch.

PS: subscription number backstage reply "text-matching" can receive small evening package spree Oh ~ good papers (including the text of papers)

Article directory

  1. Punch the baseline model
  2. Clocking mission scenarios and data sets
    a. & Paraphrase recognition similarity calculating
    b. Matches Q
    c. Dialogue matching
    d. Natural language reasoning / text contains identification
    e. Information retrieval matches
    f. Machine Reading Problems
  3. Siamese clocking structure (based representation)
  4. Fancy attention clocking structure (based on the interaction)
  5. Clocking ranking learning and assessment methods
  6. Punch pre-training model
  7. Clocking open source tools

Punch the baseline model

No matter what specific matching problem is that there are some very good baseline implementation is running about the just-direct.

My own favorite is the use of baseline model SiameseCNN this structure, after all over again from scratch hand line and very fast, and run fast, the effect is decent, training and relatively stable, affected by hyper-parameters is relatively small.

 

v2-3ae885000f570573020afa0c4ce65a19_b.jpg

Too loud model the general structure shown in FIG usually not necessary to achieve this, generally used one at each CNN to encoding and TEXTB at textA need to match, and then re-concat max pooling or click to give a mean pooling two text vectors represents vecA and vecB (figure above u and v).

After that you can directly apply some formulas such as cosine distance, L1 distance, Euclidean distance, etc. to get the similarity of the two texts, but we do not necessarily match the text is to judge these two texts are similar, in addition to a similar relationship, you can also question and answer relationship, conversation reply relationship text contains relations, and therefore a more common practice is based on the feature vectors u and v for building both model matching relationship, then with additional models (such as MLP) to learn a common text function mapping relationship.

This feature vector may include the same as the FIG. vec1, the thing | vec1-vec2 |, vec1 * vec2, May also include some of the more fancy Features, such as small evening often with the max(vec1, vec2)^2other, matching wonders in some scenarios. Of course, more or fly to carefully constructed features according to (bad) case matching the actual scene.

If you have obsession for LSTM, it can be used lstm replace cnn when sentence encoder, that is, the use of SiameseLSTM structure, where the same encoder can be used with a variety of pre-trained models look to strengthen vector representation of the text.

Burning goose, after the fact, with BERT, the more I like to take BERT to baseline when the ╮ (¯ ▽ ¯ "") ╭, after all, even without writing code, and more convenient (often a baseline run, I found that the problem solved) .

Clocking mission scenarios and data sets

First, the similarity calculating & paraphrase identification (textual similarity & paraphrase identification)

This can be said to be the most typical text matching the most classic scenes, that is, to determine two texts are not expressed the same semantics, that constitutes repeat (paraphrase) relationship. All data given level of similarity is set, the higher the level, the more similar (which more reasonable), either directly matching tag given 0/1. This type of scenario is generally modeled as a classification problem.

Representative data sets:

  • STS Task SemEval : from 2012 are held annually NLP classic game. This evaluation will indicate the degree of similarity of the two texts is 0.0 to 5.0, closer to 0.0 means that the two texts are not related, the more similar the closer to 5.0. Using Pearson correlation coefficients (Pearson Correlation) as an evaluation index.
  • Question Pairs Quora (QQP) : This data set is Quora released. Compared STS, this data set is significantly larger scale, 400K contains a question-question pairs, labeled 0/1, representatives of the two questions of meaning is the same. Since modeling became classification tasks, you can use the exact nature of acc and f1 such commonly used classification evaluation friends. (Know when to release almost HuQP a data set (¯∇¯))
  • The MSRP / on MRPC : This is a more standard paraphrase identification data sets. In QQP text data set are from problems of user questions, and MRPC in the sentence is derived from the news corpus. But MRPC much smaller scale, only 5800 samples (after all, the 2005 release of the data sets and manually labeled, so you can understand ╮ (¯ ▽ ¯ "") ╭). Like QQP, MRPC generally disaggregated indicators to assess this with acc or f1.
  • PPDB : This paraphrase data set is remotely supervise [] by means of a ranking method to do it, so the relatively large size. Text size contains lexical level (word pairs), phrase level (phrases) and syntactic level (with parsing label). And not only contain English corpus, as well as French, German, Spanish and other 15 languages (Why not Chinese!). Corpus scale from S number, M number up to XXXL No. allow users to selectively download is also very funny, of which there are more than 70 million level phrase, sentence level, there are more than 200 million. Since the corpus too large, the quality can be marked, so that even the word can be used to train vector [1] .

Second, Q matching (answer selection)

Although Q matching problem identified with the same force is modeled as repeated classification, but the actual scene is often to find the correct answer from the plurality of candidates, and the associated data set often by way of a plurality of negative + positive examples embodiment constructed match, So often modeled as a ranking problem.

On learning method, classification method can be used not only to do (called in the ranking question Pointwise Learning ), can also be used to learn other learning-to-rank, such as pairwise Learning ( "one pair of positive and negative samples with the question of" as a training sample) and listwise Learning ( "the same question of all samples sorted" as a training sample). Accordingly, the respective evaluation indexes also use more the MAP , the MRR this ranking relevant indicators.

Note: This does not mean pointwise matching this classification approach will certainly weaker performance, as detailed in the relevant papers

Representative data sets such as:

  • TrecQA : contains 56k of questions and answers for (but only 1K more problems, negative samples Super Multi), but the original data set is slightly dirty, contains no answers samples and only a sample of the positive samples and the only negative samples (what the hell sentence), so do the research, then take note, with some paper version clean (filter out the three types of samples), some of the original version, a data set forced into a two track.
  • WikiQA : This is a small data set is built from Microsoft's bing search query and wiki. Contains questions and answers on the 10K (1K more problems), and finally the normal sample than plus or minus a bit. Paper [2]
  • QNLI : finally have large data sets, this is the transformation out of SQuAD data sets included in the context of the sentence as a matching answer span positive cases, other cases as negative match, so there will be close to 600K Q (includes close to the problem of 100K).

Third, the dialogue match (response selection)

Dialogue match can be seen as the advanced version of the quiz to match, there are two main aspects of the upgrade.

On the one hand, introduced the history match dialogue round of talks on the basis of questions and answers on the match, within the limits of the historical wheel, and some could have been as a reply candidate will therefore become unreasonable. For example, the wheel of history mentioned that you're 18 years old, so for the query "What are you doing at home today do," you can not reply, "I was at home with grandchildren," the.

An example of a value of five cents (¬_¬): ps

On the other hand, for a query, conversation Reply space to space is much larger than the answer to the question, for the Q-query, the correct answer is often very limited, or even just one, but dialogue like query often have a long list of reasonable return, and even a piles of universal reply such as "Oh", "Okay", "ha ha ha." Very often reply with the query in the lexical level basically nothing in common, and therefore more difficult to train some of the dialogue matching model, data quality somewhat less difficult to converge. Therefore do enough matching questions and answers, do make quite match the dialogue meaning drop.

The problem is generally used Recall_n k @ (in n candidate, the reasonable reply to the previous k positions even recall the successful emergence) as an evaluation index, sometimes like a quiz using the same match MAP, MRR and other indicators.

Representative data sets:

  • UDC : Ubuntu Dialogue Corpus dialogue is the most classic matching task data set containing 1000K several rounds of dialogue (dialogue session), an average of 8 per session dialogue, not only large-scale and high quality, so the recent dialogue the basic matching work have to play in it. Paper [. 3]
  • Conversation Corpus douban : insists on giving UDC look for flaws, then, it is the UDC is to do it on ubuntu Technology Forum this limited domain data sets, so it is very special topic of conversation. So @ Wu Minamata Gangster release of this open dialogue to match the domain of data collection, but also because it is Chinese, so the case study process enjoyed. Paper [. 4]

Fourth, the natural language reasoning / text implies recognition (Natural Language Inference / Textual Entailment)

The purpose NLI, or RTE task is to determine the text A and text B constitute reasoning semantic / implication relations: That is, given a descriptive sentence A and a description of "hypothesis" sentence B "precondition", if the sentence A description of the premise, the sentence if B is true, then a contains the text say B, a or B can be inferred; if B is false, say a and B conflicting texts; If you can not come to B is based on a true or false, say A and B are independent.

Obviously the task can be seen as a 3-way classification tasks, you can use natural training methods and evaluation index classification task. Of course, there are some earlier data set contains only text or not is determined, where the data set is not attached to these.

Representative data sets:

  • SNLI : Stanford Natural Language Inference data set is one of the times the depth of learning NLP landmark data set, 2015, when the release of 570,000 samples of handwriting and pure manual annotation can be said that the industry's conscience, then became a field of NLP is very rare depth learning test site. Paper [. 5]
  • MnII : Multi-Genre Natural Language Inference datasets with similar SNLI, SNLI can be seen as an upgraded version includes different styles of text (spoken and written), the sentence contains 433k
  • XNLI : stands for Cross-lingual Natural Language Inference. At the name would have guessed that this is a multi-language data sets, XNLI is based on the number of samples MNLI translated into another 14 languages (including Chinese).

V. Information Retrieval match

In addition to the above four scenes, as well as in the text query-title match, query-document matching, the scene matching information retrieval. However, in information retrieval scenarios, the first general recall-related items by searching method, and then the related items rerank. On such issues, the more important is the Ranking , rather than non-black and white or simple selection. ranking question can not rely solely on this feature text one dimension, and it is relatively judgment semantic matching of two texts of how deep and how delicate the relationship is not so important.

From a purely text dimension is, qa, qr NLI matches and related methods in theory of course be applied on a query-title problems; and query-doc issue is more of a searching problem, as the conventional retrieval model TFIDF , BM25, etc. Although it is a term (term) level of matching text, but the query expansion cooperation, in most case may have been made looks good effects. If I have to consider the semantic level of match, you can use the traditional method LSA, LDA topics such models. Of course, forced the depth of learning is not the problem, for example, do some query understanding, or even directly match the query-doc (as long as you are willing to drop resource deployment), such as work-related

DSSM:CIKM2013 | Learning Deep Structured Semantic Models for Web Search using Clickthrough Data
CDSSM:WWW2014 | Learning Semantic Representations Using Convolutional Neural Networks for Web Search
HCAN:EMNLP2019 | Bridging the Gap between Relevance Matching and Semantic Matching for Short Text Similarity Modeling

Sixth, the machine reading comprehension questions

At the same time, there are some not-so-intuitive text-matching tasks, such as reading comprehension machine (MRC). This is a problem in the text fragment answers section, another point of view can be modeled as Q band matching context (although a bit more candidate ╮ (¯ ▽ ¯ "") ╭). Representative data sets such as SQuAD series, MS MARCO, CoQA, NewsQA, respectively, cover a lot of typical NLP problem: MRC task modeling, multi-document issue, several rounds of interaction issues, reasoning problems. So do match, such as work-related representation BiDAF, DrQA and so it's best punch.

BiDAF:ICLR2017 | Bidirectional Attention Flow for Machine Comprehension
DrQA:ACL2017 | Reading Wikipedia to Answer Open-Domain Questions

PS:

In fact, the above model of each scene not too bad, even some of the ways to experiment, nearly two years of paper directly on more than one match scenes are mostly claim that he is a very general framework of matching / model. Therefore, the following describes paper punch when it does not distinguish between scenes, but is divided into and expressed based on the interaction point to introduce punch.

Note: While text-based matching method represented (usually Siamese network structure) and matching method based on the interaction of (general use attention Fancy complete interaction) dispute for several years, but in the end was the end of the text matching or BERT and the juniors . So please take the following two cherish the memory of the history of mood to punch, do not tangle the details of paper, generally know the story just fine.

Siamese clocking structure (based representation)

First of two text such a configuration that were mentioned at the outset and then get their encoding vector representation, and to obtain a final matching relationship by the similarity calculating function or related structures.

On the basis of the baseline SiameseCNN and SiameseLSTM mentioned on the stage, do nothing more in this direction is down in both directions:

1. Strengthen the encoder, better text representation

2. Strengthening function modeling similarity calculation

For the first direction, is nothing more than the use of deeper and more powerful Encoder, punch working as a representative

InferSent:EMNLP2017 | Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

ps: although the real purpose of this paper is to transfer learning

SSE:EMNLP2017 | Shortcut-Stacked Sentence Encoders for Multi-Domain Inference

For the second direction, is the use of a more fancy similarity calculation function or fancy network structure similarity function for learning, it can work as punch

SiamCNN:ASRU2015 | Applying deep learning to answer selection: A study and an open task
SiamLSTM:AAAI2016 | Siamese Recurrent Architectures for Learning Sentence Similarity
Multi-view:2016 EMNLP | Multi-view Response Selection for Human-Computer Conversation

Obviously, this direction playability is not strong (though easy enough to write out paper work but cool), so do not ask why only updated to 2017, since 2016 attention on everywhere, and naturally we are running to catch the tide do the basic flower type structure of the interaction.

Fancy attention clocking structure (based on the interaction)

As the name suggests, this idea is first to interact different size (word level, phrase level, etc.) represented by the structural attention to two texts, and the matching result of each particle size by a construction polymerized together as a super Further feature vector to obtain a final matching relationship.

Obviously this idea, in addition to interactive text for more than fancy, is to consider the model becomes darker (and thus modeling higher level of matching relationship).

But personal experience, though this line of thinking can play a lot of tricks, some of the papers argue the point also appears to have some truth, but many models are actually crazy (violence) on one or two very few data sets mad ( force) change (search) into the (cable) structure and only then the various fractions were painted, it appears to cause this structure in a scene or even just some of the data sets Work, in fact, this structure may only meet the specific data Some characteristics or distribution of a particular scene, resulting in a lot of work into the new scene on the effect overturned, even trying to tune parameters are transferred does not move too much.

Therefore, although in the model proposed BERT before such paper looks big, but not as good as the data set may change a little pat on the head tone parameters of SiameseCNN easy to use. So this type of paper in the brush, do not be confused by fancy model structure honey eyes oh, a lot of related work, pick and choose a few more representative or have more information, or easy to read.

MatchCNN:AAAI2016 | Text Matching as Image Recognition
DecAtt:EMNLP2016 | A Decomposable Attention Model for Natural Language Inference
CompAgg:ICLR2017 | A COMPARE-AGGREGATE MODEL FOR MATCHING TEXT SEQUENCES
ESIM:ACL2017 | Enhanced LSTM for Natural Language Inference
2018 COLING | Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering

ps: this paper can actually be seen as a large sum of experiments and analysis of each model front

DAM:ACL2018 | Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network
HCAN:EMNLP2019 | Bridging the Gap between Relevance Matching and Semantic Matching for Short Text Similarity Modeling

In addition, particular attention here model symmetry about problems like text matching similarity calculation in such scenarios / qq match / title-title match is symmetrical, i.e., match (a, b) = match (b, a) but after the asymmetric model, the model will make its own additional learning this prior knowledge, unless the data set is large, or have pre-trained, otherwise the effect is very easy to roll. Of course, there are a number of tricks can be forced to use an asymmetric model, i.e. in such scenario for each sample run again match (a, b) and match (b, a) and then taking the average, but the model is compared with the natural symmetry how well would you look at the level of alchemy teacher friends

ranking punch learning and Index

pointwise / pairwise / listwise learning learning strategy has three ranking data everywhere, and not go into here. Here to recommend this article are not familiar little friends

SLin: Natural Language Processing (NLP) Interview essential: pointwise, pairwise, listwise

Evaluation of MAP, MRR, NDCG such as not familiar with the junior partner can see below this article

felix: Learning to rank the basic algorithm Summary

Clocking pretrain models

Even after several years of alchemy, by the model structure already achieved good results in very large text-matching task scene, but the experiment proved that they could not follow the model of the vast corpus pretrain ratio, first on a map, Q & A experimental results on data set TrecQA:

 

v2-07aa5688e1541a2fc06d8eaf0674efad_b.jpg

HCAN which is EMNLP2019 newly proposed model, having been suspended or beaten up ESIM, DecAtt old generation fancy model, but can still be seen BERT suspended or beaten up, not to mention to contrast with XLNet, ERNIE2.0 and RoBERTa and other recent models a. So really unified text matching task, then, the current situation can not do without large-scale pre-training model.

Of course, we must use the traditional matching model, then, at least ELMo could be used to manually force a head Xiaoxuming []

Clocking open source tools

Although the text matched baseline easy to construct, but to build a complete system or workload specific scene relatively large, with some of the easy to use open source tools can greatly enhance the development efficiency.

MatchZoo : a generic text-matching tool kit, include a very large number of representative data sets, matching models and scenes, friendly interface, very suitable used to run baseline.
AnyQ : Q framework and a set for the FAQ, and plug configuration mechanism is made Like integrated pile, and some representative matching model retrieval model, complete coverage of the Question Analysis, Retrieval, Matching and four Re-Rank a necessary part of doing all the answering system.
DGU : a bert-based universal dialogue comprehension tool provides a simple but effective solution to the task of dialogue, a key task maxed out individual conversations (including several rounds of dialogue match) of SOTA is a magical experience.

PS: subscription number backstage reply "text-matching" can receive small evening package spree Oh ~ good papers (including the text of papers)

reference

  1. ^ 2015TACL | From Paraphrase Database to Compositional Paraphrase Model and Back
  2. ^ Yang Y, Yih W, Meek C. Wikiqa: A challenge dataset for open-domain question answering[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 2013-2018
  3. ^ Lowe R, Pow N, Serban I, et al. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems[J]. arXiv preprint arXiv:1506.08909, 2015.
  4. ^ Wu Y, Wu W, Xing C, et al. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots[J]. arXiv preprint arXiv:1612.01627, 2016.
  5. ^ Bowman S R, Angeli G, Potts C, et al. A large annotated corpus for learning naturallanguage inference[J]. arXiv preprint arXiv:1508.05326, 2015

 

Published 33 original articles · won praise 0 · Views 3282

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/104553503