General process of building knowledge graph

  • Data acquisition
    generally requires knowledge modeling before data acquisition to establish the data model of the knowledge graph. Two methods can be used: one is a top-down method, with the help of open source structured data, and experts manually edit the data to form the data. Mode; the other is a bottom-up approach, based on the existing industry standards for conversion or mapping from existing high-quality industry data sources. The process of data modeling is very important, because a standardized schema can effectively reduce the cost of connecting domain data.
    Data type: 1. Structured data, such as relational database; 2. Unstructured data, such as pictures, audio, video; 3. Semi-structured data, such as XML, JSON,
    source of encyclopedia data: CN-DBpedia (Wikipedia) Data) + crawler (distributed crawler scrapy+redis can be used)
    information extraction: word segmentation, word filtering, word vector, entity naming, relationship extraction, entity disambiguation,
  • data storage
    • The design of the knowledge graph (same as the database): the future development of the company's business must be considered -> store the graph database
      • Need to determine the entities, relationships, attributes
      • Which attributes can be used as entities? Which entities can be used as attributes?
      • What information does not need to be placed in the knowledge graph
      • Business principles: For example, according to the user who makes the call, all the information related to the user's bank is carried out, and the risk control of the customer's business handling, etc.
      • Analysis principle: Any entity in the knowledge graph serves for relationship analysis. If the entity is not helpful for analyzing the network, it can be set as an attribute or not placed in the knowledge graph
      • Redundancy principle: There will be some nodes (super nodes) in the knowledge graph that have a link relationship with most of the nodes, and those that are of little significance can be omitted, which will affect the efficiency of the query. For example: Gender -> Male, all male users will have a link with this node, you can put "male" in the user attribute.
      • Efficiency principle: The knowledge graph only stores key information, and the rest is stored in traditional databases. Put the frequently received information into the knowledge graph, and put the infrequently used information into the traditional database
    • Store through a standardized storage format such as RDF (Resource Description Framework)
    • Company business table + SPO (triple table)
  • The structure of the knowledge graph
    • Architecture type: logical architecture
      • The logical architecture is divided into two levels: data layer and mode layer
      • Mode layer: above the data layer, the core of the knowledge graph, storing the knowledge of the association, and managing it through the library
      • Data layer: store real data
      • Example: Mode layer: entity-relation-entity, entity-attribute-value
        Data layer: Bill Gates-wife-Melinda Gates, Bill Gates-President-Microsoft
    • Architecture Type: Technical Architecture

      • First of all, the data may be structured, unstructured, or semi-structured----->use these data to construct a knowledge graph (extract knowledge through a series of automated and semi-automated technical means)----->entities and relationships to be obtained Mode layer and data layer stored in the knowledge base
      • Iterative update of knowledge graph
        • Information extraction: Extract entities, attributes, and relationships from various types of data sources. To form ontological knowledge expression
          1. Entity extraction, namely named entity recognition (NER). Automatically recognize named entities from text data sets.
          2. Relation extraction, in order to obtain the semantic information of the extracted discrete named entities, it is also necessary to extract the association relations between the entities from the related corpus, and link the entities through the relations to form a network knowledge structure. Related learning content:
            • Artificial construction of grammatical and semantic rules (pattern matching);
            • Statistical machine learning methods;
            • Supervised learning method based on feature vector or kernel function;
            • The research focus shifted to semi-supervised and unsupervised;
            • Begin to study information extraction methods for open domains;
            • Combine open-domain information extraction methods with traditional methods for closed domains
          3. Attribute extraction: Collect attribute information of a specific entity from different information sources. For example, a celebrity can obtain nickname, birthday, nationality, education and other information from public information on the Internet. Related learning:
            • Regard the attributes of the entity as a nominal relationship between the entity and the attribute value, and transform the attribute extraction task into a relation extraction task
            • Extract structured data based on rules and heuristic algorithms
            • Based on the semi-structured data of encyclopedia websites, the training corpus is generated by automatic extraction, which is used to train the entity attribute annotation model, and then it is applied to the entity attribute extraction of unstructured data
            • Use data mining methods to directly mine the relationship between entity attributes and attribute values ​​from the text, and realize the positioning of attribute names and attribute values ​​in the text based on this
        • Knowledge fusion: After acquiring new knowledge, it needs to be integrated to eliminate contradictions and ambiguities. For example, certain entities may have multiple expressions. Main work: entity linking, knowledge merging
          1. The relationship between information is flat, lacking hierarchy and logic. There is still a lot of miscellaneous and wrong information in the knowledge.
          2. Entity link: The operation of linking the entity object extracted from the text to the corresponding correct entity object in the knowledge base. The basic idea is to select a group of candidate entity objects from the knowledge base according to a given entity reference item, and then connect the reference necklace to the correct entity object through similarity calculation. related information:
            • It is necessary to extract moral entities from the text and link them to the knowledge base, but also to consider the semantic connections between entities in the same document
            • Focus on using the co-occurrence relationship of entities and link multiple entities to the knowledge base at the same time, that is, collective entity linking. Entity link process:
              • Extract entity referents from the text through entity extraction;
              • Carry out entity disambiguation and common reference resolution to determine whether the same-named entity in the knowledge base represents different meanings, and whether there are other named entities in the knowledge base that have the same meaning
              • After confirming the correct entity object in the knowledge base, connect the entity designation necklace to the corresponding entity in the knowledge base
              • Entity disambiguation is a technology specifically used to solve the problem of ambiguity caused by entities with the same name. Through entity disambiguation, entity links can be accurately established according to the current context. The entity disambiguation technology mainly uses clustering. In fact, it can also be regarded as a context-based classification problem, similar to part of speech disambiguation and word sense disambiguation.
              • Co-reference resolution technology is mainly used to solve the problem of multiple indicators corresponding to physical objects. In a conversation, multiple references may refer to the same entity object. Using coreference resolution technology, these referents can be associated (combined) to the correct entity object. This problem is of special importance in fields such as information retrieval and natural language processing. There are some other names for coreference resolution: object alignment, entity matching, and entity synonymy
          3. Knowledge merging: The entity link introduced above is data extracted from semi-structured and unstructured data through information, and there is another data source-structured data (such as external knowledge bases and relational databases). There are two types of knowledge merging:
            • Consolidate external knowledge bases to deal with conflicts between the data layer and the model layer
            • Combine relational databases, with RDB2, RDF and other methods
        • Knowledge processing: For the new knowledge that has been fused, it needs to undergo quality assessment (some of which require manual labor) before the qualified part can be added to the knowledge base. Knowledge processing mainly includes three contents: ontology construction, knowledge reasoning and quality evaluation
          1. Ontology construction

            • Ontology refers to a collection of workers' concepts, conceptual frameworks, such as "person", "thing", "thing", etc.
            • The ontology can be constructed manually (with the help of ontology editing software) by manual editing, or it can be constructed in a data-driven automated way. Because the manual workload is huge and it is difficult to find experts who meet the requirements, the current mainstream global ontology library products are based on some existing ontology libraries for specific fields and gradually expanded by using automatic construction technology.
            • The automated ontology construction process includes three stages: 1. Entity parallel relationship similarity calculation; 2. Entity upper and lower relationship extraction; 3. Ontology generation.
              For example: the first step: when the knowledge graph has just obtained the three entities of "Alibaba", "Tencent", and "Mobile", it may think that there is no difference between the three entities, but when it calculates After the similarity between the three entities, you will find that Alibaba and Tencent may be more similar, and more different from mobile phones. The second step: The knowledge graph does not actually have a concept of upper and lower levels. It still does not know that Alibaba and mobile phones are not affiliated with the same type at all and cannot be compared. Therefore, we need to complete this work in the step of extracting the subordinate relationship of the entity. Thus, the entity of the third step is generated.
          2. Knowledge reasoning

            • Ontology construction only builds the prototype of the knowledge graph, but most of the relationships between the knowledge graphs are incomplete, and the missing values ​​are very serious. In this case, knowledge reasoning technology must be used.
            • Note: The object of knowledge inference is not limited to the relationship between entities, but can also be the attribute value of the entity, the conceptual hierarchical relationship of the ontology, etc.
              Example:
              Reasoning attribute value: Knowing the birthday attribute of an entity, the age attribute of the entity can be obtained by reasoning;
              reasoning concept: Knowing (tiger, family, cat) and (feline, order, carnivorous) can be deduced (Tiger, eye, carnivorous)
            • The main classification of algorithms: logic-based reasoning, graph-based reasoning and deep learning-based reasoning
          3. Quality assessment

            • Quality assessment is also an important part of the knowledge base construction technology. The significance of this part is that the credibility of knowledge can be quantified, and the quality of the knowledge base can be guaranteed by discarding knowledge with lower confidence.
        • Knowledge update
          • The update of the knowledge base includes the update of the conceptual layer and the update of the data layer. The update of the
            conceptual layer means that new concepts are obtained after adding data, and the new concepts need to be automatically added to the conceptual layer of the knowledge base.
            The update of the data layer is mainly to add or update entities, relationships, and attribute values. The update of the data layer needs to consider the reliability of the data source, the consistency of the data (whether there are contradictions or redundancy, etc.) and other reliable data sources, and select Facts and attributes that appear frequently in various data sources are added to the knowledge base
          • There are two ways
            to update the content of the knowledge graph: Comprehensive update: refers to the construction of the knowledge graph from scratch by taking all the updated data as input. This method is relatively simple, but it consumes a lot of resources, and requires a lot of human resources to maintain the system;
            incremental update: using the current new data as input, adding new knowledge to the existing knowledge graph. This method consumes less resources, but still requires a lot of manual intervention (defining rules, etc.), so it is very difficult to implement.
  • Application of Knowledge Graph
    • Algorithm analysis of graph mining
    • Derive various phenomena/application scenarios from complex networks
      • Intelligent search-it is also the most mature scene of the knowledge map, which automatically gives search results and related characters to
        construct a character relationship graph to view more dimensional data
      • Anti-fraud: There are two main reasons for this. One is that anti-fraud data sources are diverse, structured and unstructured, and the other is that many fraud cases involve complex networks of relationships.
      • Inconsistency verification (similar to cross-validation)-relational reasoning
      • Anomaly analysis (large computational load, generally offline)
        • Static analysis: Given a graph structure and a certain point in time, find some abnormal points (such as abnormal subgraphs) from it.
        • Dynamic analysis: analyze the trend of its structure over time. (Assuming that the structure of the knowledge graph will not change too much in a short time, if it changes a lot, it means that there may be an abnormality and further attention is needed. It will involve timing analysis technology and graph similarity calculation technology.)
      • Lost contact customer management digs out more new contacts and improves the success rate of collection.
        Insert picture description here
        Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_46046193/article/details/108632485