Are Noisy Sentences Useless for Distant Supervised Relation Extraction?

Paper information: paper , AAAI2020

Outline

Relation extraction intended to extract structured triples from unstructured text, such as "Barack Obama was born in the United States" to identify the relationship between the entity "Barack Obama" and "United States" is the text from the "born in" to give (Barack Obama, born in, United States) the triad. The task of a major problem is the lack of a lot of manual annotation data, based on remotely supervise construction supervision data (so that text is aligned with the Freebase knowledge base, if there is an entity of a triple statement, the statement is considered expresses the relationship of correspondence) gradually gained extensive research.

However, the remote supervision and assume too strong, often face the problem of noise data, that is, the relationship between the training set label part of the statement is wrong. At present, there are two types of solutions:

  1. soft Method: The sentences of the same entity comprising a composition sentence bag, is then calculated as a weight of each sentence, the sentence is assigned a lower noise reduction effect on the weight of the prediction entities, thereby reducing noise.
  2. Method hard: a sentence recognition noise directly from sentence bag and weed out.

However, these methods are ignoring the real cause of the problem, namely the lack of proper labeling. Therefore, this paper turn to solve this problem by correcting noise sentence label, so not only reduces the noise impact, but also increased the number of correct statements. The paper consists of three main modules DCRE model:

  • Encoding module statement: means for encoding text vector obtained
  • Noise detection module: by calculating a degree of matching between the sentence and the sentence bag tags, selected statement noise
  • Label generation module: assigns the correct label relations for the selection of noise clustering method based on statements depth

method

Problem Definition

Assumed set of relationships as \ (\ mathbb {R & lt} = \ {R_1, R_2, \ DOTS, RJc \} \) , \ (\ mathbb {S} _B = \ {S_1, S_2, \ DOTS, S_B \} \) is an entity \ ((e_1, e_2) \ ) corresponding sentence bag, relation extraction purposes is based on \ (\ mathbb {S} _b \) prediction entity \ ((e_1, e_2) \ ) relationship \ ( r_i \) .

Text Encoding module

The figure shows the paper is used for text encoding module, but also the entity relation extraction is the most popular text encoder. Its general process is: Given a sentence of a text entities, each word is first converted to \ (d_w \) dimensional vector word, and then splicing the two dimensions in accordance with the words of the distance calculation entity \ (D_P \) position vector, the final word of each vector is of dimension \ (D_S = d_w 2d_p + \) ; spliced into the matrix obtained CNN sentence extracting n-gram network characteristic; the location of the entity in the sentence feature points three sections, each pool of each maximum; and wherein after splicing to give a final pool of textual statement coding vector.

Noise detection module

Assume that a sentence bag Chinese present vectors \ (H_b \ in \ mathbf {R & lt} ^ {B \ Times D_S} \) , all relationship vectors \ (L \ in \ mathbf { R} ^ {k \ times d_s} \) , where \ (B \) represents the number of statements in Bag, \ (D_S \) represents a vector dimension, \ (K \) is the number of relationships. Vector statement \ (H_i \) and the relationship between the label \ (l_j \) correlation expressed by the dot product between:
\ [\ I} = {alpha_ H_i l_j ^ T \]
Next, all statements of the use of correlation within a bag softmax normalized
\ [\ alpha_i = \ FRAC {\ exp (\ alpha_i)} {\ sum_b \ exp (\ alpha_i)} \]
\ (\ alpha_i \) represents the relationship between the label statement probability of being correct. When \ (\ alpha_i \) is less than the threshold value \ (\ Phi \) when the statement statement is considered noise.

However, this is no guarantee that more relevant statement is not mislabeled sentence. Therefore, the paper did some restrictions: In a sentence bag, the relevance score maximum sentence is considered to be correct statement, relevance score less than the threshold value of the statement is considered to be noise sentence, while in the middle of the statement will be ignored. Doing so has two advantages: 1) If the correlation score maximum sentence statement indeed correct, then select it as a positive example of the bag is in line with expressed-at-least-one hypothesis; 2) If the correlation score the biggest statement noise statements, other statements that ignore the noise reduction is also a way. In either case, we have reached the purpose of noise reduction.

Tag generating module

Tag generating module object is to pick out the noise redistribution sentence labels at high confidence, which is achieved through clustering. Suppose \ (\ mathbf {H} \ in \ mathbf {R} ^ {n \ times d_s} \) is the vector of all the text representation of a statement, \ (\ mathbf {L} \ in \ mathbf {R & lt} ^ {K \ times d_s} \) is a vector representation of all relationships. The first step of the paper is projected to the text vector space relationship:
\ [\ mathbf {C} = \ mathbf of HL {T} + ^ \ mathbf {B} \]
then \ (\ mathbf {C} \ ) is polymerized class, generating \ (N_C \) classes clusters \ (\ {\ mathbf {\ MU _i} \} _ {I} = ^ {N_C. 1} \) . The paper used \ (T \) - similarity measure as a function of the distribution, so \ (C_i \) and clusters by \ (\ mu_i \) similarity is defined as:
\ [ij of Q_ {} = \ {FRAC (. 1 + || \ mathbf {c} _i - \ mathbf {\ mu} _j || ^ 2) ^ {- 1}} {\ sum_j (1+ || \ mathbf {c} _i - \ mathbf {\ mu} _j || ^ 2) ^ {- 1 }} \]

\ (q_ {ij} \) is the projection vector statement \ (C_i \) and clusters by \ (\ mu_j \) similarity may be regarded as corresponding to the sentence \ (S_I \) assignment relationship label \ ( r_j \) probability.

Clustering using the KL divergence objective function is defined:
\ [\ mathcal {L} = KL (P || Q) = \ sum_i \ sum_j ij of P_ {} \ log \ FRAC P_ {{} {} ij of Q_ {} ij of } \]
where \ (P \) is the target profile. NYT-10 because the relationship between data sets subject to long-tailed distribution, so that the reference paper (Xie, Girshick, and Farhadi 2016 ) work, the \ (P \) is defined as:
\ [ij of P_} = {\ Q_ {{ij of FRAC } ^ 2 / \ sum_i q_ { ij}} {\ sum_j (q_ {ij} ^ 2 / \ sum_i q_ {ij})} \]

In addition, the paper regenerate only for non-NA relationship label statements, because real relationships sentence label NA (labeled as unrelated) is often difficult to know, and to assign them labels can cause more noise. Instead, the non-labeled NA NA re-statement meant to remove the noise sentence.

Zoom function loss

Because there is no obvious noise sentence supervision information, clustering results it is difficult to ensure that right, that is redistributed label may also be wrong. So the paper in the selection of noise sentence established threshold \ (\ Phi \) . In addition, also \ (q_ {ij} \) as the weight of the cross-entropy function of weight loss, which is scaled so that the labels obtained by the clustering model for the influence of clustering is proportional to its degree of confidence. The final cross-entropy loss function is defined as:
\ [\ the begin {the aligned} \ mathcal {J} (\ Theta) = & - \ SUM _ {(x_i, y_i) \ in \ mathbb {V}} \ log P (y_i | x_i; \ Theta) \\ & -
\ lambda \ sum _ {(x_i, y_i) \ in \ mathbb {N}} q_ {ij} \ log p (y_i | x_i; \ Theta) \ end {aligned} \] wherein \ ((x_i, y_i) \ ) is a training instance that represents the statement \ (x_i \) relationship tag is \ (y_i \) . \ (y_j \ neq y_i \) represents \ (x_i \) new relationship label. \ (\ mathbb {V} \ ) is the largest correlation score set of statements, \ (\ mathbb {N} \) is the set of statements selected noise, \ (\ Theta \) are the model parameters.

experiment

The classic paper NYT-10 test data set, the evaluation index is commonly used Precision-recall curves, shown in FIG experimental results were as follows:

Guess you like

Origin www.cnblogs.com/weilonghu/p/12543162.html