Extraction method in the construction of knowledge graph

Write it casually, just read it.
to be continued

The existing open source knowledge graphs generally use triples for storage, and also include ontology definitions at the conceptual level. Small knowledge graphs can be viewed from the perspective of attribute graphs, which should specifically include the attributes of points, edges, and points, among which there are different types of points and edges. Generally speaking, a new knowledge graph needs to be defined.

  1. Entity type collection
  2. The type set of the relationship and its limitations (what kind of points are the starting point and the ending point)
  3. Entity-attribute name-attribute value triple
  4. Entity-relation-entity triples
  5. Property value collection (definition rules)

In the construction of the knowledge graph, 1. The type set of entities and 2. The type set of relations and their limitations are generally pre-defined by using fixed domain knowledge and knowledge graph usage requirements, which is also equivalent to the ontology concept layer in the open source knowledge graph. The construction process of the knowledge graph mainly concerns 3. entity-attribute name-attribute value triples and 4. entity-relation-entity triples . Generally speaking, task 3 is treated as task 4. However, there are also obvious differences, and no comparison will be made.
Here, the extraction tasks in the construction of the knowledge base can be divided into:

  1. Entity extraction
  2. Attribute extraction
  3. Relation extraction
  4. Entity attribute extraction
  5. Joint extraction of entities and relationships

1. Entity extraction

Entity extraction is also called named entity recognition. In terms of methods, there are:

  1. Use external dictionaries and external knowledge bases for recognition
  2. Rule-based method for identification
  3. Method based on statistical learning.

Now open-domain named entity recognition has become a basic task, and a large number of tools can achieve better results in named entity recognition.
However, the effect of entity extraction in a specific field is not ideal. You can annotate enough data and use the BERT, LSTM+CRF model to extract entities using the sequence annotation model. However, the cost of constructing a data set is generally high. Generally, entities can be quickly extracted in combination with external dictionaries or knowledge bases and definition rules.

2. Attribute value extraction

Compared with the entity name, the attribute value has a broader definition, and its type is definitely more than that of the entity. It feels like you can completely treat attributes as entities, use sequence labeling to complete attribute extraction, or identify a large number of attributes based on rules (after all, there are attributes such as length and width).

Three, relation triple extraction

The method of relation extraction can be roughly divided into rule-based methods, supervised, semi-supervised, unsupervised, remote supervision, and open domain relation extraction according to the training mode in machine learning.

Rule-based relation extraction

Unsupervised relationship extraction does not require training data, and mainly relies on domain experts to customize certain rules (regular expressions) to extract relationships between entities. A simple example is " Trump 's nationality is the United States ", we just need to define it. A name entity is followed by "nationality is" followed by a national entity. Then it can be extracted Trump - citizenship - the United States . Of course, the actual approach will be more complicated than this, after all, the middle fields cannot be exhaustive. But the rule-based method is to customize some rules to achieve the task of extraction.
Of course, in the process of extraction, in fact, the data we get is not just the string "Trump’s nationality is the United States". In the example just now, we used "Trump" as a person and "United States" It is the information of a country. We can also mine other information in the sentence. The most common is to use grammatical information and semantic information, which can be used in relation extraction to improve the completeness of our defined rules.

Supervised relation extraction

Supervised relation extraction is actually relation classification. Relationship classification is mainly achieved by using statistical machine learning methods. The input is a sentence "Trump's nationality is the United States" and two entities "Trump" and "United States", as well as a set of predefined relationship types, and the output is A relationship in the collection of types (here, "nationality"). Subdivision also includes three methods, which are based on feature vectors, kernel functions, and deep learning. Since the input and output of this problem are very clear, there are now a large number of neural network models to deal with this clear classification problem. The best effect should be Linlin Wang's multi-layer attention CNN paper address . Similarly, to solve this problem, CNN feels better than RNN, but from an instinctive perspective, RNN can better solve this type of sequence data problem. Of course, you can also use the pre-training model to improve the model. I simply tested that it can only reach 82 on the original CNN model, and it reaches 86.9 after adding bert. github link . And Ali uses bert to splice two entities and sentence vectors for classification, and the F1 value reaches 89.25. Paper address .
The data sets processed by these papers are all SEM_EVAL 2010 task 8 task sets. Relatively speaking, the data sets are small. I have seen that the tasks on this data set are constantly over-fitting. In actual task scenarios The effect is not guaranteed.

The main advantage of the neural network-based approach is that it is end-to-end, without pre-extracting features or syntax analysis.

It is difficult to predefine a complete candidate relationship for the relationship extraction task in actual application scenarios, whether it is a small knowledge graph or a large knowledge graph.

Semi-supervised relation extraction

Semi-supervised relationship extraction mainly uses the idea of ​​bootstrapping: it first artificially constructs the lack of relationship instances as a seed set, and then uses the method of pattern learning or training models to expand the relationship set through continuous iteration, and finally obtain a relationship of sufficient scale.

Unsupervised relation extraction

Unsupervised relation extraction should also include rule-based methods. This mainly refers to the use of clustering algorithms to determine the relationship type. Generally, when the application scenario is an unknown relationship type, the corpus needs to be used to determine the relationship type. It generally needs the support of large-scale corpus, through the redundancy of the corpus to get the set of possible relationship patterns, and finally determine the name of the relationship.

Remote supervision relationship extraction

The relationship extraction of remote supervision mainly uses the existing knowledge base and document collection. It assumes that a sentence contains two entities, and there is a relationship between the two entities in the knowledge base, then this sentence contains information about the relationship between the two entities. There is a Chinese character relationship data set that is constructed by using the relationship characters in the encyclopedia and the news sentences at the same time.
This hypothesis is a strong hypothesis and contains a lot of wrong data, which will inevitably affect the effect of the subsequent machine learning algorithms. Therefore, a lot of research has focused on reducing the error caused by remote supervision.

Open domain relation extraction

In fact, the relation extraction of open domain is mainly task-oriented. The technology used is still the previous technology, but the problem it solves is a new problem in the extraction of open domain.

Four, attribute triple extraction

Entity-attribute name-attribute value extraction can be directly converted into entity-relation-entity triples for processing, and the other can directly use rules to extract attribute values ​​and entity-attribute names-attribute values. There is a lot of work to do the joint extraction of entities and relationships. Attribute values ​​and entity-attribute names-attribute values ​​are more suitable for joint extraction. After all, extracting attribute values ​​separately is not suitable for general purpose.

5. Joint extraction of entities and relationships

Entity-relation joint extraction is a model that extracts entities and relationships at the same time, instead of following a pipeline that extracts entities first and then extracts relationships. The advantage of course is that this extraction can avoid error propagation, and the information of the extracted entity can be used in relation extraction tasks. In addition, this is more end-to-end: input sentences, output relation triples.

1. You can directly unify relations and entities into one task. For example, using sequence labeling model, "Trump is the President of the United States, he lives in Washington", Trump's label is "E1-Position", the US President's label is "E2-Position", and his label is "E1-Address", The Washington label is "E2-Address", so we can identify the two relational triples in the task. This task can be accomplished using all sequence labeling models. Of course, this model has such a problem: it is difficult to deal with the same A situation where an entity involves multiple relationships.

2. You can use multi-task learning, using a network to learn entity extraction and relationship extraction models at the same time. BERT-Based Multi-Head Selection for Joint Entity-Relation Extraction .
Insert picture description here
It simultaneously learns two models of relation extraction and entity extraction to achieve the effect of joint extraction. And in order to avoid the previous problem that the same entity involves multiple relationships, here each tail entity learns an index that points to the head entity. In this way, the same head entity can involve multiple relationships.

The distinction between attributes and relationships. Specifically for the two triples about Trump. The two triples of "Trump-Gender-Male" and "Trump-Country-U.S.", we can be divided into the attribute triples about Trump or Trump The relationship of the triple. Should "gender" and "country" be regarded as attributes or relations? Generally speaking, attributes represent the internal characteristics of entities, and relationships represent the external connections of entities. From an operational point of view, the map is finally stored in the knowledge base. Will we find Trump through men? Will we find Trump through the United States?
If we access the information expressed by this triplet mainly through entities, then it should be divided into an attribute.
If we can access the information expressed by this triplet through entities and tail entities, then it should be divided into relationships.

Data set
SEM-EVAL2010 task8 English data set (available on the official website, available by contact email)
Chinese name data set (available on github, available by contact email)
2019 language and smart technology competition data (available by contact email)

Guess you like

Origin blog.csdn.net/lovoslbdy/article/details/98847655