2. Domain-specific knowledge map fusion scheme: text matching algorithm pre-training Simbert, ERNIE-Gram single-tower model and many other models [3]

insert image description here

Introduction to the knowledge map column: data enhancement, intelligent labeling, text information extraction (entity relationship event extraction), knowledge fusion algorithm scheme, knowledge reasoning, model optimization, model compression technology, etc.

insert image description here

Detailed introduction of the column: Introduction to the knowledge map column: data enhancement, intelligent labeling, text information extraction (entity relationship event extraction), knowledge fusion algorithm scheme, knowledge reasoning, model optimization, model compression technology, etc.

  • NLP knowledge map-related technical business implementation plan and code source, this column will continue to update knowledge map (knowledge fusion, knowledge reasoning, etc.), NLP business implementation plan and code source.
  • At the same time, I will sort out and summarize valuable information to save you a lot of time, and quickly obtain valuable information for scientific research or business implementation.

Domain-specific knowledge map fusion scheme: pre-training models SimBert and ERNIE-Gram for text matching algorithms

See the end of the article for project links and code sources

Text matching task is one of the very important basic tasks in natural language processing, and generally studies the relationship between two pieces of text. There are many application scenarios; such as information retrieval, question answering system, intelligent dialogue, text identification, intelligent recommendation, text data deduplication, text similarity calculation, natural language reasoning, question answering system, information retrieval, etc., but text matching or natural language processing There are still many difficulties. These natural language processing tasks can be abstracted to a large extent as text matching problems. For example, information retrieval can be reduced to matching search terms and document resources, question answering systems can be reduced to matching questions and candidate answers, and paraphrase problems can be reduced to two matching of synonymous sentences.

0. Preface: Domain-Specific Knowledge Graph Fusion Solution

This project mainly focuses on domain-specific knowledge graph (Domain-specific Knowledge Graph: DKG) fusion scheme: text matching algorithm, knowledge fusion academic scheme, knowledge fusion industry landing scheme, algorithm evaluation KG production quality assurance, explained the overview of text matching algorithm, From the classic traditional model to the twin neural network "Twin Towers Model" to the pre-training model and the supervised and unsupervised joint model, it also involved the cutting-edge comparative learning model in recent years, and then proposed a text matching skill improvement plan, and finally The landing plan of DKG is given. Here we mainly focus on principle explanations and technical solutions. Afterwards, we will gradually open source the project and build KG together. We will strive to go through a complete process from knowledge extraction to knowledge fusion, knowledge reasoning, and quality assessment.

0.1 Pre-reference projects

Pre-referenced project

1. Domain-specific knowledge map fusion scheme: technical knowledge preposition [1] - text matching algorithm

https://blog.csdn.net/sinat_39620217/article/details/128718537

2. Domain-specific knowledge map fusion scheme: text matching algorithms Simnet, Simcse, Diffcse [2]

https://blog.csdn.net/sinat_39620217/article/details/128833057

3. Domain-specific knowledge map fusion scheme: text matching algorithm pre-training Simbert, ERNIE-Gram single-tower model and many other models [3]

https://blog.csdn.net/sinat_39620217/article/details/129026570

4. Domain-specific knowledge graph fusion scheme: applying what you have learned - verification of the question matching robustness evaluation competition [4]
https://blog.csdn.net/sinat_39620217/article/details/129026193

A collection of NLP knowledge map projects (information extraction, text classification, graph neural network, performance optimization, etc.)

https://blog.csdn.net/sinat_39620217/article/details/128805154

2023 Top Conference in the Computer Field and ACL Natural Language Processing (NLP) Research Subdirection Summary

https://blog.csdn.net/sinat_39620217/article/details/128897539

0.2 Conclusion first look

The simulation results are as follows:

Model dev acc
Simcse (unsupervised) 58.97%
Diffcse (unsupervised) 63.23%
bert-base-chinese 86.53%
bert-wwm-chinese 86.33%
bert-wwm-ext-chinese 86.05%
ernie-tiny 86.07%
roberta-wwm-ext 87.53%
rbt3 85.37%
rbtl3 85.17%
ERNIE-1.0-Base 89.34%
ERNIE-1.0-Base 89.34%
ERNIE-Gram-Base-Pointwise 90.58%
  1. The SimCSE model is suitable for matching and retrieval scenarios that lack supervised data but have a large amount of unsupervised data.

  2. Compared with the SimCSE model, the DiffCSE model will pay more attention to the differences between sentences, and has the ability of accurate vector representation. The DiffCSE model is also suitable for matching and retrieval scenarios that lack supervised data but have a large amount of unsupervised data.

  3. It is obvious that ERNIE-Gram in the supervised model has better performance than all previous models

1.SimBERT(UniLM)

The pre-training model can be divided into three categories according to the training method or network structure:

  • One is the auto-encoding (Auto-Encoding) language model represented by BERT[2] , Autoencoding Language Modeling, auto-encoding language model: predict the currently masked token through context information, representatives include BERT, Word2Vec (CBOW), etc. It uses MLM for pre-training tasks. Self-encoded pre-training models are often better at discriminative tasks , or Natural Language Understanding (NLU) tasks, such as text classification, NER, etc.

p ( x ) = ∏ x ∈  Mask  p ( x ∣ p(x)=\prod_{x \in \text { Mask }} p(x \mid p(x)=x Mask p(x content ) ) )

Disadvantages : Due to the use of [MASK] marks in training, the pre-training and fine-tuning stages are inconsistent, and the support for generative problems is poor. Advantages
: It can encode contextual semantic information very well, in natural language understanding (NLU) Outstanding performance on related downstream tasks

  • The second is the auto-regressive (Auto-Regressive) language model represented by GPT[3] , Autoregressive Lanuage Modeling, auto-regressive language model: predict the token at the current moment according to the token that appears before (or behind), and the representative models include ELMO, GTP, etc., it generally uses generation tasks for pre-training, similar to writing an article, the autoregressive language model is better at doing generation tasks (Natural Language Generating, NLG), such as article generation .

 forward:  p ( x ) = ∏ t = 1 T p ( x t ∣ x < t )  backward :  p ( x ) = ∏ t = T 1 p ( x t ∣ x > t ) \begin{aligned} & \text { forward: } p(x)=\prod_{t=1}^T p\left(x_t \mid x_{<t}\right) \\ & \text { backward : } p(x)=\prod_{t=T}^1 p\left(x_t \mid x_{>t}\right)\end{aligned}  forward: p(x)=t=1Tp(xtx<t) backward : p(x)=t=T1p(xtx>t)

Disadvantages : Only one-way semantics can be used and context information cannot be used at the same time.
Advantages : Friendly to natural language generation tasks (NLG), in line with the generation process of generative tasks

  • The third is the pre-training model based on the encoder-decoder model architecture, such as MASS[4], which encodes the input sentence into a feature vector through the encoder, and then converts the feature vector into an output text sequence through the decoder . The advantage of the pre-training model based on Encoder-Decoder is that it can take into account both the self-encoding language model and the autoregressive language model: a classification layer can be connected after its encoder to make a discriminative task, while using the encoder and decoder at the same time Then you can do generation tasks .

The unified language model (Unified Language Model, UniLM) [1] to be introduced here looks from the network structure, and its structure is the same encoder structure as BERT. But judging from its pre-training tasks, it can not only use the masked context for training like an autoencoding language model, but also train from left to right like an autoregressive language model. Even a model like the Encoder-Decoder architecture can encode the input text first, and then generate a sequence from left to right.

UniLM is a pre-trained language model proposed by Microsoft Research on the basis of BERT, which is called a unified pre-trained language model. Using three special Mask pre-training objectives, so that the model can be used for NLG, and at the same time, it can achieve the same effect as BERT in NLU tasks. It can complete
one-way, sequence-to-sequence and two-way prediction tasks. It can be said that it combines AR and The advantages of the two language models of AE, UniLM has achieved SOTA results in the fields of text summarization and generative question answering

[1] Dong, Li, et al. “Unified language model pre-training for natural language understanding and generation.” Advances in Neural Information Processing Systems 32 (2019).

[2] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[J]. arXiv preprint arXiv:1810.04805, 2018.

[3] Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., 2018. Improving language understanding by generative pre-training.

[4] Song, Kaitao, et al. "Mass: Masked sequence to sequence pre-training for language generation."arXiv preprint arXiv:1905.02450(2019).

1.1 Detailed explanation of UniLM model

原始论文:Unified Language Model Pre-training for Natural Language Understanding and Generation

The three different types of pre-training architectures just introduced often require different pre-training tasks for training. But these tasks can be summarized as predicting unknown content based on known content . The difference is which content is known to us and which needs to be predicted. The core content of UniLM unifies the tasks used to train different architectures into a framework similar to the mask language model, and then adapts different tasks through a variable mask matrix M (Mask Matrix) .
The content of all the cores of UniLM can be summarized as the following figure.

UniLM's network structure and its different pre-training tasks

The model framework is shown in the figure above. In the pre-training phase, the UniLM model jointly learns a Transformer network through three language models with different objective functions (including: two-way language model, one-way language model and sequence-to-sequence language model); Different self-attention masks are used to control the visible context of the token to be predicted. That is, different masks are used to control the number of visible context words of predicted words to achieve different model representations.

1.1.1 Model input

First, for an input sentence, UniLM uses WordPiece to segment it. In addition to the token embedding obtained by word segmentation, UniLM adds position embedding (in the same way as BERT) and segment embedding (Segment Embedding) for distinguishing two segments of text pairs. In order to get the feature vector of the whole sentence, UniLM adds the [SOS] flag at the beginning of the sentence. In order to split the different segments, it adds the [EOS] flag to it. For specific examples, please refer to the content in the blue dashed box in the figure. Including token embedding, position embedding, segment embedding, and segment embedding can also be used as an identification of the training method (one-way, two-way, sequence-to-sequence) adopted by the model

1.1.2 Network structure

As shown in the red dotted box in Figure 1, UniLM uses LLThe architecture of the L -layer Transformer, in order to distinguish between different pre-training tasks can share this network, UniLM adds amask matrixoperator to it. Specifically, we assume that the input text is expressed as
{ xi } i = 1 ∣ x ∣ \left\{\boldsymbol{x}_i\right\}_{i=1}^{|x|}{ xi}i=1x, it gets the input of the first layer H 0 = [ x 1 , ⋯ , x ∣ x ∣ ] \boldsymbol{H}^0=\left[\boldsymbol{x}_1, \cdots, \boldsymbol{ x}_{|x|}\right]H0=[x1,,xx] , then afterLLThe final feature vector is obtained after the L -layer Transformer, expressed as H l = Transformer ( H l − 1 ) , l ∈ [ 1 , L ] \boldsymbol{H}^l=\text { Transformer }\left(\boldsymbol{H }^{l-1}\right), l \in[1, L]Hl= Transformer (Hl1),l[1,L ] ,interpretation functionH 1 = [ h 1 l , ... , h ∣ x ∣ 1 ] \mathbf{H}_1=\left[\mathbf{h}_1^{\mathbf{l}}, \ldots , \mathbf{h}_{|\mathbf{x}|}^{\mathbf{1}}\right]H1=[h1l,,hx1] for different levels of contextual representation. At eachTransformer l \text{Transformer}_lTransformerlblock, multiple self-attention heads are used to aggregate the output vector of the previous layer. for the lllTransformer l \text{Transformer}_lTransformerl层,self-attention head A l \mathbf A_l AlThe output is passed below. Unlike the original Transformer, UniLM adds a mask matrix to it, with the llLayer l is taken as an example. At this time, the Transformer is transformed into the forms shown in formula (1) to formula (3).

Q l = H l − 1 W l Q K l = H l − 1 W l K V l = H l − 1 W l V M i j = { 0 ,  allow to attend  − ∞  prevent from attending  A l = softmax ⁡ ( Q l K l ⊤ d k + M ) V l \begin{gathered}\boldsymbol{Q}_l=\boldsymbol{H}^{l-1} \boldsymbol{W}_l^Q \quad \boldsymbol{K}_l=\boldsymbol{H}^{l-1} \boldsymbol{W}_l^K \quad \boldsymbol{V}_l=\boldsymbol{H}^{l-1} \boldsymbol{W}_l^V \\ \boldsymbol{M}_{i j}= \begin{cases}0, & \text { allow to attend } \\ -\infty & \text { prevent from attending }\end{cases} \\ \boldsymbol{A}_l=\operatorname{softmax}\left(\frac{\boldsymbol{Q}_l \boldsymbol{K}_l^{\top}}{\sqrt{d_k}}+\boldsymbol{M}\right) \boldsymbol{V}_l\end{gathered} Ql=Hl1WlQKl=Hl1WlKVl=Hl1WlVMij={ 0, allow to attend  prevent from attending Al=softmax(dk QlKl+M)Vl

其中 H l − 1 ∈ R ∣ x ∣ × d h \mathbf{H}^{l-1} \in \mathbb{R}^{|x| \times d_h} Hl1Rx×dhUse the parameter matrix W l Q , W l K , W l V \boldsymbol{W}_l^Q, \boldsymbol{W}_l^K, \boldsymbol{W}_l^VWlQ,WlK,WlVLinearly projected into the triplet Query, Key, Value respectively , M ∈ R ∣ x ∣ × ∣ x ∣ M \in \mathbb{R}^{|x| \times |x| }MRx × x is the mask matrix we mentioned many times before to control the pre-training task. According to the mask matrixMMM determines whether a pair of tokens can attend each other, covering the encoded features, so that the prediction can only focus on the features related to specific tasks, thus realizing different pre-training methods.

1.1.3 Unification of tasks

UniLM has a total of 4 pre-training tasks. In addition to the three language models shown in Figure 1, there is also a classic NSP task. We will introduce them separately below.

  • Bidirectional language model :

    • MASK cloze task, the input is a text pair [ SOS , x 1 , x 2 , M ask , x 4 , EOS , x 5 , MASK , x 7 , EOS ] [SOS,x_1,x_2,Mask,x_4, EOS,x_5,MASK,x_7,EOS][SOS,x1,x2,Mask,x4,EOS,x5,MASK,x7,EOS]
    • The bidirectional language model is the top task in Figure 1. Like the masked language model, it uses the context to predict the masked part. , which is consistent with the Bert model. When predicting the masked token, all tokens can be observed. As shown in the figure above, a matrix of all 0s is used as the mask matrix. The model needs to be analyzed according to all contexts, so MMM is a 0 matrix.
  • One-way language model :

    • MASK cloze task, the input is a single text [ x 1 , x 2 , M ask , x 4 ] [x_1,x_2,Mask,x_4][x1,x2,Mask,x4]
    • The one-way language model can be from left to right or from right to left. The example in Figure 1 is from left to right, which is the mask method used in GPT[3]. In this prediction method, when the model predicts the content of the t-th time slice, it can only see the content before the t-th time slice, so MMM is an upper triangular all− ∞ -\inftyThe upper triangular matrix of − (the shaded part of the second mask matrix in Figure 1). Similarly, when the one-way language model is from right to left,MMM is a lower triangular matrix. In this training method, the observation sequence is divided into two types: left to right and right to left, from left to right, that is, to predict the masked token only through all the text on the left side of the masked token; from right to left , is to predict the masked token only through all the text on the right side of the masked token, as shown in the figure above, use the upper triangular matrix as the mask matrix, the shaded part is, the blank part is 0,
  • Seq-to-Seq language model :

    • MASK cloze task, the input is a text pair [ SOS , x 1 , x 2 , M ask , x 4 , EOS , x 5 , MASK , x 7 , EOS ] [SOS,x_1,x_2,Mask,x_4, EOS,x_5,MASK,x_7,EOS][SOS,x1,x2,Mask,x4,EOS,x5,MASK,x7,EOS]
    • If the masked token is in the first text sequence, only all tokens in the first text sequence can be used, and no information from the second text sequence can be used; if the masked token is in the second text sequence, then use All tokens in one text sequence and all tokens to the left of the masked token in the second text sequence predict the masked token
    • As shown in the figure above, during training, a sequence consists of [SOS]S_1[EOS]S_2[EOS], where S1 is source segments and S2 is target segments. Randomly mask the words in the two segments. If the masked is the word of the source segment, it can attend to all the tokens of the source segment. If the masked is the target segment, the model can only attend to all the source tokens and target. The current word in the segment and all tokens to the left of the word, so that the model can learn a two-way encoder and one-way decoder (similar to transformer) implicitly

In Seq-to-Seq tasks, such as machine translation, we usually first encode the input sentence into a feature vector through an encoder, and then decode this feature vector into a prediction content through a decoder. The structure of UniLM is very different from the traditional Encoder-Decoder model. It only consists of a multi-layer Transformer. During pre-training, UniLM first stitches two sentences into a sequence, and divides the sentence by [EOS], expressed as: [SOS]S1[EOS]S2[EOS]. When encoding, we need to know the complete content of the input sentence, so there is no need to overwrite the input text. But when decoding, the decoder part becomes a left-to-right unidirectional language model. Therefore, for the block matrix corresponding to the first segment (S1 part) in the sentence, it is a 0 matrix (upper left block matrix), and for the corresponding block matrix of the second segment (S2 part) of the sentence, it is the upper triangle Part of the matrix (upper right block matrix). So we can get the bottom MM in Figure 1M. _ It can be seen that although UniLM adopts the encoder architecture, it can also pay attention to all the features of the input and the generated features of the output like the classic Encoder-to-Decoder when training the Seq-to-Seq language model.

  • NSP : UniLM also adds NSP as a pre-training task like BERT. For the bidirectional language model (Bidirectional LM), like the Bert model, it also predicts the next sentence. If it is the next text of the first text, predict 1; otherwise predict 0

1.1.4 Training and fine-tuning

Training : During training, 1/3 of the time is used to train the two-way language model, and 1/3 of the time is used to train the one-way language model, in which half of the stations are from left to right and from right to left, and the last 1/3 is used To train the Encoder-Decoder architecture.

Fine-tuning : For NLU tasks, we can directly regard UniLM as an encoder, then get the feature vector of the whole sentence through the [SOS] flag, and then get the predicted category by adding a classification layer after the feature vector. For NLG tasks, we can concatenate sentences into a sequence "[SOS]S1[EOS]S2[EOS]" as described above. where S1 is the entire content of the input text. For fine-tuning, we randomly mask out parts of the target sentence S2. At the same time, we can mask the [EOS] of the target sentence. Our purpose is to let the model predict when to predict [EOS] and stop the prediction, instead of predicting a length that we set in advance.

  • Network settings: 24-layer Transformer, 1024 hidden sizes, 16 attention heads
  • Parameter size: 340M
  • Initialization: directly adopt Bert-Large parameter initialization
  • Activation function: GELU, same as bert
  • dropout ratio: 0.1
  • Weight decay factor: 0.01
  • batch_size:330
  • Mixed training method: For a batch, 1/3 of the time uses the target of the two-way language model, 1/3 of the time uses the target of the Seq2Seq language model, and the last 1/3 is evenly allocated to the two unidirectional learning language models, that is, left- The to-right and right-to-left methods each account for 1/6 of the time
  • MASK method: the overall ratio is 15%, of which 80% of the cases are directly replaced by [MASK], in 10% of the cases, a word is randomly selected for replacement, and in the last 10% of the cases, the real value is used. In addition, in 80% of the cases, only one word is masked at a time, and in the other 20% of the cases, the mask drops bi-gram or tri-gram

1.1.5 Summary

UniLM and many Encoder-Decoder architecture models (such as MASS) are like unifying NLU and NLG tasks, but there is no doubt that UniLM's architecture is more elegant. When MASS is doing NLU tasks, it only uses the Encoder part of the model, thus discarding all the features of the Decoder part. One problem with UniLM is that when doing classic Seq-to-Seq tasks such as machine translation, its masking mechanism causes it not to use the full sentence feature corresponding to the [SOS] flag, but to use the sequence of the input sentence. This method may lack the capture of the characteristics of the entire sentence, resulting in a lack of control over the global information in the generated content. Furthermore, UniLM outperforms previous state-of-the-art models on five NLG datasets: CNN/DailyMail and Gigaword text summarization, SQuAD question generation, CoQA generation question answering, and DSTC7-based dialogue generation, and its advantages are summarized as follows:

  • Three different training objectives, shared network parameters
  • Network parameter sharing makes the model avoid overfitting to a single language model, making the learned model more universal
  • The Seq2Seq language model is adopted, so that it can complete the NLG task while being able to complete the NLU task

1.2 SimBert

1.2.1 SimBERT Model Fusion Retrieval and Generation

Based on the idea of ​​UniLM, the BERT model integrates retrieval and generation.

Weight download: https://github.com/ZhuiyiTechnology/pretrained-models

The core of UniLM is to endow the model with the ability of Seq2Seq through a special Attention Mask. If the input is "what do you want to eat" and the target sentence is "white chicken", then UNILM will combine these two sentences into one: [CLS] what do you want to eat [SEP] white chicken [SEP], and then follow as Attention Mask of the graph:

In other words, [CLS] What do you want to eat [SEP] tokens are two-way Attention, while the tokens [SEP] are one-way Attention, allowing recursive prediction of chicken [SEP] These tokens, so it has the ability to generate text.

UNILM does a Seq2Seq model diagram. The input part can be used for two-way Attention, and the output part can only be used for one-way Attention.

Seq2Seq can only show that UniLM has the ability of NLG, so why did it say that it has both NLU and NLG capabilities? Because of UniLM’s special Attention Mask, [CLS] What do you want to eat [SEP] these 6 tokens only do Attention among themselves, and have nothing to do with the white-cut chicken [SEP], which means that although the splicing later [SEP], but this will not affect the first 6 encoded vectors. To make it clear, the first 6 encoding vectors are equivalent to the encoding results when there is only [CLS] What do you want to eat [SEP]. If the vector of [CLS] represents a sentence vector, then it is the sentence of what do you want to eat vector, not the sentence vector after adding white chicken.

Due to this feature, UniLM also randomly adds some [MASK] when inputting, so that the input part can do MLM tasks, and the output part can do Seq2Seq tasks. MLM enhances NLU capabilities, and Seq2Seq enhances NLG capabilities, killing two birds with one stone.

1.2.2 SimBert

SimBERT belongs to supervised training. The training corpus is similar sentence pairs collected by itself. The Seq2Seq part is constructed by predicting the similar sentence generation task of another sentence. Then the vector of [CLS] mentioned earlier actually represents the input. The sentence vector, so it can be used to train a retrieval task at the same time, as shown below

Assuming that SENT_a and SENT_b are a group of similar sentences, then in the same batch, [CLS] SENT_a [SEP] SENT_b [SEP] and [CLS] SENT_b [SEP] SENT_a [SEP] are added to the training to make a similar sentence Generate tasks, this is the Seq2Seq part.

On the other hand, take out the [CLS] vectors in the entire batch to get a bxd ​​sentence vector matrix V (b is batch_size, d is hidden_size), and then perform l2 normalization on the d dimension to get a new V, Then do the inner product in pairs to get the similarity matrix VV^T of bxv, then multiply it by a scale (we took 30), and mask off the diagonal part, and finally perform softmax on each row, as a classification task training, each The target label of a sample is its similar sentence (as for itself has been masked). To put it bluntly, it is to treat all non-similar samples in the batch as negative samples, and use softmax to increase the similarity of similar samples and reduce the similarity of other samples.

For detailed introduction, please see: https://kexue.fm/archives/7427

Some results show:

>>> gen_synonyms(u'微信和支付宝哪个好?')

[
    u'微信和支付宝,哪个好?',
    u'微信和支付宝哪个好',
    u'支付宝和微信哪个好',
    u'支付宝和微信哪个好啊',
    u'微信和支付宝那个好用?',
    u'微信和支付宝哪个好用',
    u'支付宝和微信那个更好',
    u'支付宝和微信哪个好用',
    u'微信和支付宝用起来哪个好?',
    u'微信和支付宝选哪个好',
    u'微信好还是支付宝比较用',
    u'微信与支付宝哪个',
    u'支付宝和微信哪个好用一点?',
    u'支付宝好还是微信',
    u'微信支付宝究竟哪个好',
    u'支付宝和微信哪个实用性更好',
    u'好,支付宝和微信哪个更安全?',
    u'微信支付宝哪个好用?有什么区别',
    u'微信和支付宝有什么区别?谁比较好用',
    u'支付宝和微信哪个好玩'
]

>>> most_similar(u'怎么开初婚未育证明', 20)
[
    (u'开初婚未育证明怎么弄?', 0.9728098), 
    (u'初婚未育情况证明怎么开?', 0.9612292), 
    (u'到哪里开初婚未育证明?', 0.94987774), 
    (u'初婚未育证明在哪里开?', 0.9476072), 
    (u'男方也要开初婚证明吗?', 0.7712214), 
    (u'初婚证明除了村里开,单位可以开吗?', 0.63224965), 
    (u'生孩子怎么发', 0.40672967), 
    (u'是需要您到当地公安局开具变更证明的', 0.39978087), 
    (u'淘宝开店认证未通过怎么办', 0.39477515), 
    (u'您好,是需要当地公安局开具的变更证明的', 0.39288986), 
    (u'没有工作证明,怎么办信用卡', 0.37745982), 
    (u'未成年小孩还没办身份证怎么买高铁车票', 0.36504325), 
    (u'烟草证不给办,应该怎么办呢?', 0.35596085), 
    (u'怎么生孩子', 0.3493368), 
    (u'怎么开福利彩票站', 0.34158638), 
    (u'沈阳烟草证怎么办?好办不?', 0.33718678), 
    (u'男性不孕不育有哪些特征', 0.33530876), 
    (u'结婚证丢了一本怎么办离婚', 0.33166665), 
    (u'怎样到地税局开发票?', 0.33079252), 
    (u'男性不孕不育检查要注意什么?', 0.3274408)
]

1.2.3 SimBER Training Prediction

SimBERT's model weight is based on Google's open-source BERT model, and based on Microsoft's UniLM idea, a task that integrates retrieval and generation is designed to further fine-tune the resulting model, so it has both similar query generation and similar sentence retrieval capabilities .

The data set uses LCQMC for reference: https://aistudio.baidu.com/aistudio/projectdetail/5423713?contributionType=1

#数据准备:使用PaddleNLP内置数据集
from paddlenlp.datasets import load_dataset
train_ds, dev_ds, test_ds = load_dataset("lcqmc", splits=["train", "dev", "test"])

#保存数据集并查看
import json
with open("/home/aistudio/output/test.txt", "w+",encoding='UTF-8') as f:    #a :   写入文件,若文件不存在则会先创建再写入,但不会覆盖原文件,而是追加在文件末尾
    for result in dev_ds:
        line = json.dumps(result, ensure_ascii=False)  #对中文默认使用的ascii编码.想输出真正的中文需要指定ensure_ascii=False
        f.write(line + "\n")
#数据有上传一份也有内置读取,根据个人喜好自行选择

Partial display of the data set to be predicted:

开初婚未育证明怎么弄?	初婚未育情况证明怎么开?	1
谁知道她是网络美女吗?	爱情这杯酒谁喝都会醉是什么歌	0
人和畜生的区别是什么?	人与畜生的区别是什么!	1
男孩喝女孩的尿的故事	怎样才知道是生男孩还是女孩	0
这种图片是用什么软件制作的?	这种图片制作是用什么软件呢?	1
这腰带是什么牌子	护腰带什么牌子好	0
什么牌子的空调最好!	什么牌子的空调扇最好	0

Pay attention to the data format here. unlabeled

开初婚未育证明怎么弄?	初婚未育情况证明怎么开?	
谁知道她是网络美女吗?	爱情这杯酒谁喝都会醉是什么歌	
人和畜生的区别是什么?	人与畜生的区别是什么!	
男孩喝女孩的尿的故事	怎样才知道是生男孩还是女孩	
这种图片是用什么软件制作的?	这种图片制作是用什么软件呢?	
这腰带是什么牌子	护腰带什么牌子好	
什么牌子的空调最好!	什么牌子的空调扇最好	
#模型预测
# %cd SimBERT
!export CUDA_VISIBLE_DEVICES=0
!python predict.py --input_file /home/aistudio/LCQMC/dev.txt

Predict according to predict.py.py to get the similarity, part of the display:

{'query': '开初婚未育证明怎么弄?', 'title': '初婚未育情况证明怎么开?', 'similarity': 0.9500292}
{'query': '谁知道她是网络美女吗?', 'title': '爱情这杯酒谁喝都会醉是什么歌', 'similarity': 0.24593769}
{'query': '人和畜生的区别是什么?', 'title': '人与畜生的区别是什么!', 'similarity': 0.9916624}
{'query': '男孩喝女孩的尿的故事', 'title': '怎样才知道是生男孩还是女孩', 'similarity': 0.3250241}
{'query': '这种图片是用什么软件制作的?', 'title': '这种图片制作是用什么软件呢?', 'similarity': 0.9774641}
{'query': '这腰带是什么牌子', 'title': '护腰带什么牌子好', 'similarity': 0.74771273}
{'query': '什么牌子的空调最好!', 'title': '什么牌子的空调扇最好', 'similarity': 0.83304036}

Taking the threshold above 0.9 as the similarity judgment, the result obtained is consistent with the marked answer 1010100.

2.Sentence Transformers (ERNIE/BERT/RoBERTa/Electra)

With the development of deep learning, the number of model parameters has increased rapidly. To train these parameters, larger datasets are required to avoid overfitting. However, for most NLP tasks, it is very difficult (and expensive) to construct large-scale labeled datasets, especially for syntactic and semantic related tasks. In contrast, the construction of large-scale unlabeled corpora is relatively easy. In order to utilize this data, we can first learn a good representation from it, and then apply these representations to other tasks. Recent studies have shown that pretrained models (Pretrained Models, PTM) based on large-scale unlabeled corpora have achieved good performance on NLP tasks.

In recent years, a large number of studies have shown that pretrained models (Pretrained Models, PTM) based on large corpora can learn general language representation, which is beneficial to downstream NLP tasks and can avoid training models from scratch. With the development of computing power, the emergence of deep models (ie Transformer) and the enhancement of training skills make PTM continue to develop, from shallow to deep.

After Baidu's pre-training model ERNIE has been trained with massive data, its feature extraction work has been done very well. Drawing on the idea of ​​transfer learning, we can use the semantic information it learns in massive data to assist tasks on small datasets (such as the medical text dataset in this example). The model Fine-tune represented by ERNIE completes the text matching task.

Use the pre-trained model ERNIE to complete the text matching task. You may think of splicing query and title texts, then input them into ERNIE, take the CLS feature (pooled_output), and then output the fully connected layer for binary classification . The following figure shows the usage of ERNIE for sentence pair classification tasks:

However, the problem with the above usage is that the model parameters of ERNIE are very large, resulting in a very large amount of calculation, and the prediction speed is not ideal . As a result, the requirements of online business cannot be met. To solve this problem, you can use the PaddleNLP tool to build a Sentence Transformer network.

**Sentence Transformer adopts the network structure of twin towers (Siamese). Query and Title are input into ERNIE respectively, and share an ERNIE parameter to obtain their respective token embedding features. After that, pooling is performed on the token embedding (the tutorial here uses mean pooling operation), and then the output is recorded as u and v respectively. Then the three representations (u, v, |uv|) are spliced ​​together for binary classification. The network structure is shown in the figure above. At the same time, not only ERNIR can be used as a text semantic feature extractor, but models such as BERT/RoBerta/Electra can be used as a text semantic feature extractor
**

Paper reference: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks https://arxiv.org/abs/1908.10084

So how does Sentence Transformer use Siamese's network structure to improve the prediction speed?

The advantage of Siamese's network structure is that the query and title are respectively input into the same network. For example, in the information search task, at this time, the title text in the database can be calculated in advance and the corresponding sequence_output feature can be saved in the database. When a user searches for a query, it only needs to calculate the sequence_output feature of the query and the title sequence_output feature stored in the database, and perform binary classification through a simple mean_pooling and fully connected layer. In this way, the prediction efficiency is greatly improved, and the model performance is also guaranteed.

For the Siamese network structure commonly used in matching tasks, please refer to: https://blog.csdn.net/thriving_fcl/article/details/73730552

2.1 Model Introduction

For the Chinese text matching problem, a series of models are open source:

  1. BERT (Bidirectional Encoder Representations from Transformers) Chinese model, abbreviated as bert-base-chinese, consists of a 12-layer Transformer network.
  2. ERNIE (Enhanced Representation through Knowledge Integration), supports ERNIE 1.0 Chinese model (abbreviated as ernie-1.0) and ERNIE Tiny Chinese model (abbreviated as ernie-tiny). Among them, ernie is composed of a 12-layer Transformer network, and ernie-tiny is composed of a 3-layer Transformer network.
  3. RoBERTa (A Robustly Optimized BERT Pretraining Approach), supports roberta-wwm-ext of the 12-layer Transformer network.

Evaluation of each model under the LQCMC dataset:

Model dev acc test acc
bert-base-chinese 0.86537 0.84440
bert-wwm-chinese 0.86333 0.84128
bert-wwm-ext-chinese 0.86049 0.83848
ernie-1.0 0.87480 0.84760
ernie-tiny 0.86071 0.83352
roberta-wwm-ext 0.87526 0.84904
rbt3 0.85367 0.83464
rbtl3 0.85174 0.83744

2.2 Model training

Taking the Chinese text matching public dataset LCQMC as an example dataset, you can run the following command to train the model on the training set (train.tsv) and verify it on the development set (dev.tsv)

Some results show:

global step 7010, epoch: 8, batch: 479, loss: 0.06888, accu: 0.97227, speed: 1.40 step/s
global step 7020, epoch: 8, batch: 489, loss: 0.08377, accu: 0.97617, speed: 6.30 step/s
global step 7030, epoch: 8, batch: 499, loss: 0.07471, accu: 0.97630, speed: 6.32 step/s
global step 7040, epoch: 8, batch: 509, loss: 0.05239, accu: 0.97559, speed: 6.32 step/s
global step 7050, epoch: 8, batch: 519, loss: 0.04824, accu: 0.97539, speed: 6.30 step/s
global step 7060, epoch: 8, batch: 529, loss: 0.05198, accu: 0.97617, speed: 6.42 step/s
global step 7070, epoch: 8, batch: 539, loss: 0.07196, accu: 0.97651, speed: 6.42 step/s
global step 7080, epoch: 8, batch: 549, loss: 0.07003, accu: 0.97646, speed: 6.36 step/s
global step 7090, epoch: 8, batch: 559, loss: 0.10023, accu: 0.97587, speed: 6.34 step/s
global step 7100, epoch: 8, batch: 569, loss: 0.04805, accu: 0.97641, speed: 6.08 step/s
eval loss: 0.46545, accu: 0.87264
[2023-02-07 17:31:29,933] [    INFO] - tokenizer config file saved in ./checkpoints_ernie/model_7100/tokenizer_config.json
[2023-02-07 17:31:29,933] [    INFO] - Special tokens file saved in ./checkpoints_ernie/model_7100/special_tokens_map.json

The pre-trained model used in the code example is ERNIE. If you want to use other pre-trained models such as BERT, RoBERTa, Electra, etc., just replace the model and tokenizer.

# 使用 ERNIE 预训练模型
# ernie-3.0-medium-zh
model = AutoModel.from_pretrained('ernie-3.0-medium-zh')
tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')

# ernie-1.0
# model = AutoModel.from_pretrained('ernie-1.0-base-zh')
# tokenizer = AutoTokenizer.from_pretrained('ernie-1.0-base-zh')

# ernie-tiny
# model = AutoModel.Model.from_pretrained('ernie-tiny')
# tokenizer = AutoTokenizer.from_pretrained('ernie-tiny')


# 使用 BERT 预训练模型
# bert-base-chinese
# model = AutoModel.Model.from_pretrained('bert-base-chinese')
# tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese')

# bert-wwm-chinese
# model = AutoModel.from_pretrained('bert-wwm-chinese')
# tokenizer = AutoTokenizer.from_pretrained('bert-wwm-chinese')

# bert-wwm-ext-chinese
# model = AutoModel.from_pretrained('bert-wwm-ext-chinese')
# tokenizer = AutoTokenizer.from_pretrained('bert-wwm-ext-chinese')


# 使用 RoBERTa 预训练模型
# roberta-wwm-ext
# model = AutoModel..from_pretrained('roberta-wwm-ext')
# tokenizer = AutoTokenizer.from_pretrained('roberta-wwm-ext')

# roberta-wwm-ext
# model = AutoModel.from_pretrained('roberta-wwm-ext-large')
# tokenizer = AutoTokenizer.from_pretrained('roberta-wwm-ext-large')

For more pre-trained models, refer to transformers

Training, evaluation, and testing will be performed automatically when the program is running. At the same time, the model will be automatically saved in the specified save_dir during the training process. like:

checkpoints/
├── model_100
│   ├── model_config.json
│   ├── model_state.pdparams
│   ├── tokenizer_config.json
│   └── vocab.txt
└── ...

NOTE:

If you need to resume model training, you can set init_from_ckpt, such as init_from_ckpt=checkpoints/model_100/model_state.pdparams.
If you want to use the ernie-tiny model, you need to install the sentencepiece dependency in advance, such as pip install sentencepiece

#模型预测
!export CUDA_VISIBLE_DEVICES=0
!python predict.py --device gpu --params_path /home/aistudio/Fine-tune/checkpoints_ernie/model_7100/model_state.pdparams

Output result:

Data: ['开初婚未育证明怎么弄?', '初婚未育情况证明怎么开?'] 	 Lable: similar
Data: ['谁知道她是网络美女吗?', '爱情这杯酒谁喝都会醉是什么歌'] 	 Lable: dissimilar
Data: ['人和畜生的区别是什么?', '人与畜生的区别是什么!'] 	 Lable: similar
Data: ['男孩喝女孩的尿的故事', '怎样才知道是生男孩还是女孩'] 	 Lable: dissimilar
Data: ['这种图片是用什么软件制作的?', '这种图片制作是用什么软件呢?'] 	 Lable: similar
Data: ['这腰带是什么牌子', '护腰带什么牌子好'] 	 Lable: dissimilar
Data: ['什么牌子的空调最好!', '什么牌子的空调扇最好'] 	 Lable: dissimilar

1010100 is consistent with Simbert and the real label

Modify the code api interface reference: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/argmax_cn.html#argmax

2.3 Summary

Based on the semantic matching model SimNet and Sentence Transformers of the twin-tower Point-wise paradigm, these two schemes have higher computational efficiency and are suitable for application scenarios that require high delay and perform rough sorting based on semantic similarity.

For more information about Sentence Transformer, refer to www.SBERT.net and papers:

3. Single tower text matching of pre-trained model ERNIE-Gram

Each sample of text matching task data usually consists of two texts (query, title). The category is in the form of 0 or 1, 0 means query does not match title; 1 means match.

  • The semantic matching model ernie_matching based on the single-tower Point-wise paradigm: The model has high precision and high computational complexity, and is suitable for the application scenario of direct semantic matching and classification.
  • Semantic matching model ernie_matching based on the single-tower Pair-wise paradigm: the model has high precision and high computational complexity, and has a stronger ability to model the sequence relationship of text similarity, and is suitable for application scenarios where similarity features are used as input features of the upper-level ranking module .
  • The two schemes, the semantic matching model based on the twin-tower Point-Wise paradigm, are more computationally efficient and suitable for application scenarios that require high latency and perform rough sorting based on semantic similarity.
  1. Pointwise: Entering two texts and a label can be regarded as a classification problem, that is, to judge whether the two input texts match.
  2. Pairwise: The input is three texts, namely Query and the corresponding positive and negative samples. This training method takes into account the relative order between the texts.

Single Tower/Double Tower

  • Single Tower: The input text is first combined and then fed into a single neural network model.

  • Twin Towers: Encode the input texts into fixed-length vectors, and interactively calculate the relationship between texts through the representation vectors of the texts.

This project uses the semantic matching dataset LCQMC as the training set, based on the ERNIE-Gram pre-training model hot start training and open-sourced the single-tower Point-wise semantic matching model, users can directly perform two classification tasks for semantic matching of text pairs based on this model

Code Structure Description

ernie_matching/
├── deply # 部署
|   └── python
|       └── predict.py # python 预测部署示例
├── export_model.py # 动态图参数导出静态图参数脚本
├── model.py # Point-wise & Pair-wise 匹配模型组网
├── data.py # Point-wise & Pair-wise 训练样本的转换逻辑 、Pair-wise 生成随机负例的逻辑
├── train_pointwise.py # Point-wise 单塔匹配模型训练脚本
├── train_pairwise.py # Pair-wise 单塔匹配模型训练脚本
├── predict_pointwise.py # Point-wise 单塔匹配模型预测脚本,输出文本对是否相似: 0、1 分类
├── predict_pairwise.py # Pair-wise 单塔匹配模型预测脚本,输出文本对的相似度打分
└── train.py # 模型训练评估

Introduction to the dataset:

LCQMC is a Chinese question matching data set in the Baidu know field, the purpose is to solve the lack of large-scale question matching data sets in the Chinese field. This dataset extracts and builds data from user questions in different domains that Baidu knows.

3.1 Model Training and Prediction

Taking the Chinese text matching public dataset LCQMC as an example dataset, you can run the following command to train the single-tower Point-wise model on the training set (train.tsv) and verify it on the development set (dev.tsv).

%cd ERNIE_Gram
!unset CUDA_VISIBLE_DEVICES
!python -u -m paddle.distributed.launch --gpus "0" train_pointwise.py \
        --device gpu \
        --save_dir ./checkpoints \
        --batch_size 32 \
        --learning_rate 2E-5\
        --save_step 1000 \
        --eval_step 200 \
        --epochs 3


# save_dir:可选,保存训练模型的目录;默认保存在当前目录checkpoints文件夹下。
# max_seq_length:可选,ERNIE-Gram 模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。
# batch_size:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。
# learning_rate:可选,Fine-tune的最大学习率;默认为5e-5。
# weight_decay:可选,控制正则项力度的参数,用于防止过拟合,默认为0.0。
# epochs: 训练轮次,默认为3。
# warmup_proption:可选,学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.0。
# init_from_ckpt:可选,模型参数路径,热启动模型训练;默认为None。
# seed:可选,随机种子,默认为1000.
# device: 选用什么设备进行训练,可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。

The forecast results section shows:

global step 3810, epoch: 1, batch: 3810, loss: 0.27187, accu: 0.90938, speed: 1.25 step/s
global step 3820, epoch: 1, batch: 3820, loss: 0.24648, accu: 0.92188, speed: 21.63 step/s
global step 3830, epoch: 1, batch: 3830, loss: 0.23190, accu: 0.92604, speed: 21.38 step/s
global step 3840, epoch: 1, batch: 3840, loss: 0.35609, accu: 0.91484, speed: 20.81 step/s
global step 3850, epoch: 1, batch: 3850, loss: 0.06531, accu: 0.91687, speed: 19.64 step/s
global step 3860, epoch: 1, batch: 3860, loss: 0.16462, accu: 0.91667, speed: 20.57 step/s
global step 3870, epoch: 1, batch: 3870, loss: 0.26173, accu: 0.91607, speed: 19.78 step/s
global step 3880, epoch: 1, batch: 3880, loss: 0.26429, accu: 0.91602, speed: 19.62 step/s
global step 3890, epoch: 1, batch: 3890, loss: 0.09031, accu: 0.91771, speed: 20.49 step/s
global step 3900, epoch: 1, batch: 3900, loss: 0.16542, accu: 0.91938, speed: 21.26 step/s
global step 3910, epoch: 1, batch: 3910, loss: 0.27632, accu: 0.92074, speed: 21.87 step/s
global step 3920, epoch: 1, batch: 3920, loss: 0.13577, accu: 0.92109, speed: 22.31 step/s
global step 3930, epoch: 1, batch: 3930, loss: 0.15333, accu: 0.91971, speed: 18.52 step/s
global step 3940, epoch: 1, batch: 3940, loss: 0.10362, accu: 0.92031, speed: 21.68 step/s
global step 3950, epoch: 1, batch: 3950, loss: 0.14692, accu: 0.92146, speed: 21.74 step/s
global step 3960, epoch: 1, batch: 3960, loss: 0.17472, accu: 0.92168, speed: 19.54 step/s
global step 3970, epoch: 1, batch: 3970, loss: 0.31994, accu: 0.91967, speed: 21.06 step/s
global step 3980, epoch: 1, batch: 3980, loss: 0.17073, accu: 0.91875, speed: 21.22 step/s
global step 3990, epoch: 1, batch: 3990, loss: 0.14955, accu: 0.91891, speed: 21.51 step/s
global step 4000, epoch: 1, batch: 4000, loss: 0.13987, accu: 0.91922, speed: 21.74 step/s
eval dev loss: 0.30795, accu: 0.87253

If you want to use other pre-trained models such as ERNIE, BERT, RoBERTa, Electra, etc., just replace the model and tokenizer.


# 使用 ERNIE-3.0-medium-zh 预训练模型
model = AutoModel.from_pretrained('ernie-3.0-medium-zh')
tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')



# 使用 ERNIE-Gram 预训练模型
model = AutoModel.from_pretrained('ernie-gram-zh')
tokenizer = AutoTokenizer.from_pretrained('ernie-gram-zh')

# 使用 ERNIE 预训练模型
# ernie-1.0
#model = AutoModel.from_pretrained('ernie-1.0-base-zh'))
#tokenizer = AutoTokenizer.from_pretrained('ernie-1.0-base-zh')

# ernie-tiny
# model = AutoModel.from_pretrained('ernie-tiny'))
# tokenizer = AutoTokenizer.from_pretrained('ernie-tiny')


# 使用 BERT 预训练模型
# bert-base-chinese
# model = AutoModel.from_pretrained('bert-base-chinese')
# tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese')

# bert-wwm-chinese
# model = AutoModel.from_pretrained('bert-wwm-chinese')
# tokenizer = AutoTokenizer.from_pretrained('bert-wwm-chinese')

# bert-wwm-ext-chinese
# model = AutoModel.from_pretrained('bert-wwm-ext-chinese')
# tokenizer = AutoTokenizer.from_pretrained('bert-wwm-ext-chinese')


# 使用 RoBERTa 预训练模型
# roberta-wwm-ext
# model = AutoModel.from_pretrained('roberta-wwm-ext')
# tokenizer = AutoTokenizer.from_pretrained('roberta-wwm-ext')

# roberta-wwm-ext
# model = AutoModel.from_pretrained('roberta-wwm-ext-large')
# tokenizer = AutoTokenizer.from_pretrained('roberta-wwm-ext-large')


NOTE:

If you need to resume model training, you can set init_from_ckpt, such as init_from_ckpt=checkpoints/model_100/model_state.pdparams.
If you want to use the ernie-tiny model, you need to install the sentencepiece dependency in advance, such as pip install sentencepiece

!unset CUDA_VISIBLE_DEVICES
!python -u -m paddle.distributed.launch --gpus "0" \
        predict_pointwise.py \
        --device gpu \
        --params_path "./checkpoints/model_4000/model_state.pdparams"\
        --batch_size 128 \
        --max_seq_length 64 \
        --input_file '/home/aistudio/LCQMC/test.tsv'

The forecast results section shows:

{'query': '这张图是哪儿', 'title': '这张图谁有', 'pred_label': 0}
{'query': '这是什么水果?', 'title': '这是什么水果。怎么吃?', 'pred_label': 1}
{'query': '下巴长痘痘疼是什么原因', 'title': '下巴长痘痘是什么原因?', 'pred_label': 1}
{'query': '世界上最痛苦的是什么', 'title': '世界上最痛苦的是什么?', 'pred_label': 1}
{'query': '北京的市花是什么?', 'title': '北京的市花是什么花?', 'pred_label': 1}
{'query': '这个小男孩叫什么?', 'title': '什么的捡鱼的小男孩', 'pred_label': 0}
{'query': '蓝牙耳机什么牌子最好的?', 'title': '什么牌子的蓝牙耳机最好用', 'pred_label': 1}
{'query': '湖南卫视我们约会吧中间的歌曲是什么', 'title': '我们约会吧约会成功歌曲是什么', 'pred_label': 0}
{'query': '孕妇能吃驴肉吗', 'title': '孕妇可以吃驴肉吗?', 'pred_label': 1}
{'query': '什么鞋子比较好', 'title': '配什么鞋子比较好…', 'pred_label': 1}
{'query': '怎么把词典下载到手机上啊', 'title': '怎么把牛津高阶英汉双解词典下载到手机词典上啊', 'pred_label': 0}
{'query': '话费充值哪里便宜', 'title': '哪里充值(话费)最便宜?', 'pred_label': 1}
{'query': '怎样下载歌曲到手机', 'title': '怎么往手机上下载歌曲', 'pred_label': 1}
{'query': '苹果手机丢了如何找回?', 'title': '苹果手机掉了怎么找回', 'pred_label': 1}
{'query': '考试怎么考高分?', 'title': '考试如何考高分', 'pred_label': 1}
{'query': '带凶兆是什么意思', 'title': '主凶兆是什么意思', 'pred_label': 1}
{'query': '浅蓝色牛仔裤配什么颜色的帆布鞋好看啊', 'title': '浅蓝色牛仔裤配什么颜色外套和鞋子好看', 'pred_label': 0}
{'query': '怎么才能赚大钱', 'title': '怎么样去赚大钱呢', 'pred_label': 1}
{'query': '王冕是哪个朝代的', 'title': '王冕是哪个朝代的啊', 'pred_label': 1}
{'query': '世界上真的有僵尸吗?', 'title': '这个世界上真的有僵尸吗', 'pred_label': 1}
{'query': '梦见小女孩哭', 'title': '梦见小女孩对我笑。', 'pred_label': 0}
{'query': '这是神马电影?说什的?', 'title': '这是神马电影?!', 'pred_label': 1}
{'query': '李易峰快乐大本营饭拍', 'title': '看李易峰上快乐大本营吻戏', 'pred_label': 0}

3.2 Deployment prediction based on static graph

Model export

After training with the dynamic graph, you can use the static graph export tool export_model.py to export the dynamic graph parameters into static graph parameters. Execute the following command:

!python export_model.py --params_path checkpoints/model_4000/model_state.pdparams --output_path=./output
# 其中params_path是指动态图训练保存的参数路径,output_path是指静态图参数导出路径。

# 预测部署
# 导出静态图模型之后,可以基于静态图模型进行预测,deploy/python/predict.py 文件提供了静态图预测示例。执行如下命令:
!python deploy/predict.py --model_dir ./output

Some results show:

Data: {'query': '〈我是特种兵之火凤凰〉好看吗', 'title': '特种兵之火凤凰好看吗?'} 	 Label: similar
Data: {'query': '现在看电影用什么软件好', 'title': '现在下电影一般用什么软件'} 	 Label: similar
Data: {'query': '什么水取之不尽用之不竭是什么生肖', 'title': '什么水取之不尽用之不竭打一生肖'} 	 Label: similar
Data: {'query': '愤怒的小鸟哪里下载', 'title': '愤怒的小鸟在哪里下载'} 	 Label: similar
Data: {'query': '中国象棋大师网', 'title': '中国象棋大师'} 	 Label: dissimilar
Data: {'query': '怎么注册谷歌账号?', 'title': '谷歌账号怎样注册'} 	 Label: similar
Data: {'query': '哪里可以看点金胜手', 'title': '点金胜手哪里能看完'} 	 Label: similar
Data: {'query': '什么牌子的行车记录仪好,怎么选', 'title': '行车记录仪什么牌子好;选哪个?'} 	 Label: similar
Data: {'query': '芭比公主系列总共有哪些QUQ', 'title': '芭比公主系列动漫有哪些'} 	 Label: dissimilar
Data: {'query': '新疆省会哪里', 'title': '新疆省会是哪里?'} 	 Label: similar
Data: {'query': '今天星期几!', 'title': '今天星期几呢'} 	 Label: similar
Data: {'query': '蜂蛹怎么吃', 'title': '蜂蛹怎么养'} 	 Label: dissimilar
Data: {'query': '少年老成是什么生肖', 'title': '什么生肖是少年老成'} 	 Label: similar
Data: {'query': '有关爱国的歌曲', 'title': '爱国歌曲有哪些'} 	 Label: similar

3.3 Summary

Model dev acc
Simcse (unsupervised) 58.97%
Diffcse (unsupervised) 63.23%
bert-base-chinese 86.53%
bert-wwm-chinese 86.33%
bert-wwm-ext-chinese 86.05%
ernie-tiny 86.07%
roberta-wwm-ext 87.53%
rbt3 85.37%
rbtl3 85.17%
ERNIE-1.0-Base 89.34%
ERNIE-1.0-Base 89.34%
ERNIE-Gram-Base-Pointwise 90.58%
  1. The SimCSE model is suitable for matching and retrieval scenarios that lack supervised data but have a large amount of unsupervised data.

  2. Compared with the SimCSE model, the DiffCSE model will pay more attention to the differences between sentences, and has the ability of accurate vector representation. The DiffCSE model is also suitable for matching and retrieval scenarios that lack supervised data but have a large amount of unsupervised data.

  3. It is obvious that ERNIE-Gram in the supervised model has better performance than all previous models

Reference article: https://aistudio.baidu.com/aistudio/projectdetail/5423713?contributionType=1

4. Apply what you have learned – Thousand Words Question Matching Robustness Evaluation Competition Verification

特定领域知识图谱融合方案:学以致用-问题匹配鲁棒性评测比赛验证

本项目主要讲述文本匹配算法的应用实践、并给出相应的优化方案介绍如:可解释学习等。最后文末介绍了知识融合学术界方案、知识融合业界落地方案、算法测评KG生产质量保障等,涉及对比学习和文本。

https://blog.csdn.net/sinat_39620217/article/details/129026193

5.特定领域知识图谱(Domain-specific KnowledgeGraph:DKG)融合方案(重点!)

在前面技术知识下可以看看后续的实际业务落地方案和学术方案

关于图神经网络的知识融合技术学习参考下面链接PGL图学习项目合集&数据集分享&技术归纳业务落地技巧[系列十]

从入门知识到经典图算法以及进阶图算法等,自行查阅食用!

文章篇幅有限请参考专栏按需查阅:NLP知识图谱相关技术业务落地方案和码源

5.1特定领域知识图谱知识融合方案(实体对齐):优酷领域知识图谱为例

方案链接:https://blog.csdn.net/sinat_39620217/article/details/128614951

5.2特定领域知识图谱知识融合方案(实体对齐):文娱知识图谱构建之人物实体对齐

方案链接:https://blog.csdn.net/sinat_39620217/article/details/128673963

5.3特定领域知识图谱知识融合方案(实体对齐):商品知识图谱技术实战

方案链接:https://blog.csdn.net/sinat_39620217/article/details/128674429

5.4特定领域知识图谱知识融合方案(实体对齐):基于图神经网络的商品异构实体表征探索

方案链接:https://blog.csdn.net/sinat_39620217/article/details/128674929

5.5特定领域知识图谱知识融合方案(实体对齐)论文合集

方案链接:https://blog.csdn.net/sinat_39620217/article/details/128675199

论文资料链接:两份内容不相同,且按照序号从小到大重要性依次递减

知识图谱实体对齐资料论文参考(PDF)+实体对齐方案+特定领域知识图谱知识融合方案(实体对齐)

知识图谱实体对齐资料论文参考(CAJ)+实体对齐方案+特定领域知识图谱知识融合方案(实体对齐)

5.6知识融合算法测试方案(知识生产质量保障)

方案链接:https://blog.csdn.net/sinat_39620217/article/details/128675698

6. 总结

文本匹配任务在自然语言处理中是非常重要的基础任务之一,一般研究两段文本之间的关系。有很多应用场景;如信息检索、问答系统、智能对话、文本鉴别、智能推荐、文本数据去重、文本相似度计算、自然语言推理、问答系统、信息检索等,但文本匹配或者说自然语言处理仍然存在很多难点。这些自然语言处理任务在很大程度上都可以抽象成文本匹配问题,比如信息检索可以归结为搜索词和文档资源的匹配,问答系统可以归结为问题和候选答案的匹配,复述问题可以归结为两个同义句的匹配。

本项目主要围绕着特定领域知识图谱(Domain-specific KnowledgeGraph:DKG)融合方案:文本匹配算法、知识融合学术界方案、知识融合业界落地方案、算法测评KG生产质量保障讲解了文本匹配算法的综述,从经典的传统模型到孪生神经网络“双塔模型”再到预训练模型以及有监督无监督联合模型,期间也涉及了近几年前沿的对比学习模型,之后提出了文本匹配技巧提升方案,最终给出了DKG的落地方案。这边主要以原理讲解和技术方案阐述为主,之后会慢慢把项目开源出来,一起共建KG,从知识抽取到知识融合、知识推理、质量评估等争取走通完整的流程。

模型 dev acc
Simcse(无监督) 58.97%
Diffcse(无监督) 63.23%
bert-base-chinese 86.53%
bert-wwm-chinese 86.33%
bert-wwm-ext-chinese 86.05%
ernie-tiny 86.07%
roberta-wwm-ext 87.53%
rbt3 85.37%
rbtl3 85.17%
ERNIE-1.0-Base 89.34%
ERNIE-1.0-Base 89.34%
ERNIE-Gram-Base-Pointwise 90.58%
  1. SimCSE 模型适合缺乏监督数据,但是又有大量无监督数据的匹配和检索场景。

  2. 相比于 SimCSE 模型,DiffCSE模型会更关注语句之间的差异性,具有精确的向量表示能力。DiffCSE 模型同样适合缺乏监督数据,但是又有大量无监督数据的匹配和检索场景。

  3. 明显看到有监督模型中ERNIE-Gram比之前所有模型性能的优秀

本项目链接:

特定领域知识图谱融合方案:文本匹配算法ERNIE-Gram单塔等诸多模型【三】:
https://aistudio.baidu.com/aistudio/projectdetail/5456683?contributionType=1&sUid=691158&shared=1&ts=1681821571224

项目参考链接:

UniLM detailed explanation: https://zhuanlan.zhihu.com/p/584193190

原论文:Unified Language Model Pre-training for Natural Language Understanding and Generation:https://arxiv.org/pdf/1905.03197.pdf

Detailed UniLM model: https://www.jianshu.com/p/22e3cc4842e1

Su Shen: SimBERT model for fusion retrieval and generation: https://kexue.fm/archives/7427

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/129026570