【Review of Knowledge Graphs】Knowledge Graphs: A Survey

Knowledge Graph Overview

This article mainly summarizes based on reading the article Knowledge Graphs. ACM Comput. Surv., 54(4): 1–37. 2021. It involves relatively shallow knowledge of principles and aims to help a more comprehensive understanding of the knowledge graph. If there are any mistakes, please feel free to tell me.

1. Intro

The origin of the modern "knowledge graph" concept: 2012 announcement of the Google Knowledge Graph

Related references:
Insert image description here
Knowledge graph related examples https://github.com/knowledge-graphs-tutorial/examples .

the term

  • Knowledge graph: a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent potentially different relations between these entities. Where the data graph conforms to the graph-based data model (directed edge labeled graph , heterogeneous graphs, attribute graphs, etc.).
  • Knowledge: “explicit knowledge”, something that is known and can be written down.
  • Open vs. enterprise knowledge graphs: differentiated by organization or community

2. Data Graphs

Modeling of graph data is the basis for constructing any knowledge graph

Models

The most commonly used graph data model

  • Directed edge labeled graph (del graph, or multi-relational graph): nodes and directed labeled edges between nodes. Nodes represent entities, and edges represent binary relationships between entities. Please add image description
    Features: Flexible, no need to organize data hierarchically like a tree, and can also represent loops

  • Heterogeneous graph: Each node and edge is assigned a type.
    Please add image description

  • Homogeneous edge: located between two nodes of the same type; otherwise, it is a heterogeneous edge. Allows partitioning of nodes based on type, which can be used for machine learning tasks. But unlike del graph, nodes and types can only be one-to-one.

  • Attribute graph: Nodes and edges can be associated with attribute values ​​and labels, making modeling and modification more flexible than del graph.Please add image description

  • Graph dataset: manages multiple graphs, including a set of named graphs and a default graph. Please add image description
    The default graph has no ID and manages the metadata of the named graph. Graph datasets can be generalized to any type of graph, not limited to del graphs.

Querying

Languages ​​for querying graphs: SPARQL for del graphs, Cypher, Gremlin, G-CORE for property graphs, etc.

Graph Patterns: The graph structure used for query, including constants and variables. Variables are the variables used for matching in queries. Please add image description
The left picture shows the graph pattern, and the right picture shows the mapping of the graph pattern to graph data. That is, graph mode converts graph data into a mapped result table.

In the last three lines of the picture on the right, the contents of vn1 and vn2 are the same. This kind of query result may not be what we want. Therefore, a variety of semantics for evaluating graph patterns were born, the two most important of which are homomorphism-based semantics, which allow different variables to be mapped to the same term, that is, all mappings in the right figure are returned As a result (SPARQL adopts this semantics); based on isomorphism-based semantics, different variables cannot be mapped to the same term.

In addition, there are complex graph modes that support various SQL operators or graph query languages, as well as navigation graph modes for regular path queries, etc.

In short, the function is to find a way to design the graph pattern and return the required mapping table from the graph data.

Validation

While graphs provide flexible representations for a variety of incomplete data at large scales, we may want to verify that our data graphs follow a specific structure or are "complete" in some sense. For example, we might want to ensure that all events have at least a name, location, start and end date. One mechanism for verification is to use shape plots.

A shape graph consists of a set of interrelated shapes that target a set of nodes in the data graph and specify constraints on those nodes. Similar to UML class diagram.Please add image description

Context

Context contains implicit information about the data, such as when and where it occurred, allowing the data to be interpreted from different perspectives.

For the context information of data, you can directly use the context as a data node, or you can define edge information, etc., as shown in the figure below, e is used as an edge rather than a data node in a direct representation.Please add image description

3. Deductive Knowledge

This part contains many specifications and standards, and not many are involved in actual application scenarios. They are briefly summarized here.

We can infer more knowledge from the data graph. For example, in the first graph, we infer that the music festival is in San Diego, etc. Given data as premises and some a priori rules that we might know about the world, we can use a deductive process to derive new data.

By giving machine formalized logical results, automated reasoning can be achieved. Although we can use many logical frameworks to achieve these purposes, such as First-order Logic, Datalog, Prolog, Answer Set Programming, etc., we focus on knowledge itself (ontologies), which can be regarded as a knowledge graph with clear meaning.

Ontologies

Knowledge ontology is a representation form, a specific and formal representation method of knowledge terms. The most popular ontology language is Web Ontology Language (OWL).

In ontology, both nodes and edges can be mapped to the interpretation meaning of our real life. The resulting interpretation map is called a domain graph, which should be completely consistent with the original data graph structure.

For explanations, there are also standards for various assumptions, such as Closed World Assumption (CWA)/Open World Assumption (OWA), Unique Name Assumption (UNA)/No Unique Name Assumption (NUNA) (for example, the same entity can There are different names, etc.) etc.

In addition to hypotheses, we can also define some patterns in the data graph, and related explanations that satisfy the pattern. For example, specific patterns can be defined for entities, attributes, and even classes, so that consistent interpretation information can be discovered based on the pattern.Please add image description
Please add image description

Reasoning

The OWL standard also defines inference rules and provides a variety of ways to implement inference. The OWL standard is also influenced by more logical descriptions, Description Logics (DLs), which can be regarded as the predecessor of knowledge graphs.

4. INDUCTIVE KNOWLEDGE

Inductive reasoning can generalize patterns from input data, and these patterns can be used to generate new but potentially imprecise predictions.

For example, from the graph containing geographical and flight information, we can observe that almost all capitals of countries have international airports, and therefore predict that since Santiago is the capital, it may also have an international airport; however, some capitals (such as Vaduz ) does not have an international airport. Therefore, forecasts may not be completely accurate. If we see that 187 of the 195 capitals have international airports, then we can assign a confidence level of 0.959 to the predictions made using this model.

We call the knowledge obtained through induction inductive knowledge, including the generalized models and the predictions made by these models.

Supervised and unsupervised learning are often used in induction techniques for knowledge graphs, as shown in the figure. Please add image description
In graph analysis, there is a large amount of work using unsupervised methods, such as detecting communities or clusters, finding central nodes and edges in graphs, etc. Graph embedding uses self-supervision to learn low-dimensional numerical models in knowledge graphs. Graph structures can also be used directly for supervised learning through graph neural networks. Symbolic learning can learn symbolic models from graphs in a self-supervised manner, that is, logical formulas in the form of rules or axioms, etc.

Graph Analytics

Graph algorithms are generally used to analyze the topology of graphs, such as how nodes and groups are connected, etc. In this section, we introduce common graph algorithms applied to knowledge graphs, as well as graph processing frameworks that can implement such algorithms.

Graph Algorithms

  • Centrality analysis: Identifying the most important points and edges in the graph. Specific node centrality measures include degree, betweenness, proximity, feature vector, PageRank, HITS, Katz, etc. The betweenness measure centrality can also be applied to edges, such as predicting the busiest traffic sections by finding the most dependent edges in short routes.

  • Community detection: Identify subgraphs that are more internally connected. Including minimum cut algorithm, label propagation, Louvain modularization , etc.

  • Connectivity analysis: Estimating the connectivity and elasticity of a graph. Including measuring graph density or k-connectivity, detecting strong/weakly connected components, calculating spanning trees or minimum cuts, etc.

  • Node similarity: Find similar nodes through the way nodes are connected in the neighborhood. Including using structural equivalence, random walk, diffusion kernel and other methods to measure similarity. These methods can help us better understand what connects nodes and how they are similar.

  • Graph summarisation: Extracting high-level structure from graphs, often used in quotient graphs. Such an approach helps provide an overview for large-scale graphs. An example is shown in the figure. Please add image description
    In this case, the quotient nodes are defined in terms of outgoing edge labels, for example, we generalize from left to right to represent islands, cities, and towns respectively.

Many such graph algorithms have been proposed and studied for simple graphs or directed graphs without edge labels. In the context of knowledge graphs, one challenge is how to apply such algorithms to knowledge graph models such as del graphs, heterogeneous graphs, or attribute graphs.

Graph Processing Frameworks

For large-scale processing tasks, many parallel graph frameworks have been proposed, such as Apache Spark (GraphX), GraphLab, Pregel, Signal–Collect, Shark, etc. These frameworks use iterative computation, where each node reads the message of the incoming edge (and possibly its previous state), performs the calculation and then sends the result to the outgoing edge.

An example of the PageRank iterative algorithm on a parallel graph framework is shown in the figure. Please add image description
Algorithms in this framework consist of functions that calculate message values ​​(MSG) and accumulated messages (AGG).

Knowledge Graph Embeddings

Machine learning can be directly used to refine knowledge graphs; or it can be used for downstream tasks of knowledge graphs, such as recommendation, information extraction, question answering, query relaxation, query approximation, etc. However, machine learning techniques often employ numerical representations (e.g., vectors), which are often different from graph representations. So, how do you numerically encode graphs for machine learning?

The initial attempt was to use one-hot codes to create an L × VL\times V for each nodeL×matrix of V , LLL represents the number of edge label types,VVV represents the number of points.

Why not use a matrix of dots here? Because unlike common graph models, there may be multiple label edges between different points in the knowledge graph.

However, such a representation results in a sparse and overly large matrix that is too complex for most machine learning models.

The main goal of knowledge graph embedding technology is to create a dense representation of the graph (i.e., embedding graph) in a continuous low-dimensional (fixed, usually very low) vector space, which can be used for machine learning tasks.

In order to obtain an embedding model, according to a given scoring function, we need to maximize the plausibility of positive edges (usually edges in the graph) and minimize the plausibility of negative examples (usually changes in node or edge labels in the graph such that edges that are no longer in the graph). The resulting embeddings can then be viewed as self-supervised learning of a model that encodes the (latent) features of the graph, mapping input edges to plausibility scores.

Embeddings can be used for many downstream tasks. Plausibility scoring functions can be used to assign confidence to edges or predict missing connections. Furthermore, embeddings often assign similar vectors to similar terms and therefore can also be used for similarity measures.

Next we will discuss some of the most commonly used graph embedding techniques.

Translational Models

The translation model interprets edge labels as transitions from head nodes to tail nodes. For example Please add image description
, think of the bus tag as converting San Pedro to Moon Valley.

The pioneering approach to translation models is TransE. For the above positive edge, TransE will learn the vector e S \boldsymbol{e}_SeS r b \boldsymbol{r}_b rband e M \boldsymbol{e}_MeM, and makes e S + rb \boldsymbol{e}_S + \boldsymbol{r}_beS+rbAs close as possible to e M \boldsymbol{e}_MeM. For the negative side, let the sum of the two vectors be as far apart as possible. The figure below is TransE’s embedded representation of relationships and entities in a two-dimensional graph. Please add image description
In order to prevent the TransE algorithm from assigning similar vectors to different edges, there are also many techniques that use hyperplanes or different vector spaces to improve the algorithm.

Tensor Decomposition Models

Tensor is a multi-dimensional numerical field that can generalize scalars (order 0 tensors), vectors (order 1 tensors) and matrices (order 2 tensors) to any dimension. Tensor decomposition involves decomposing a tensor into more low-dimensional tensors, and the original tensor can be recombined (or approximated) by a sequence of basic operations fixed by these low-dimensional tensors. These tensors can be thought of as capturing the latent factors in the original tensors. There are many approaches to tensor decomposition, now we will briefly introduce the main idea behind tensor decomposition.

Given a × ba\times ba×bmatrixCC __C , where each elementC ij C_{ij}Cijrepresents the iii cityjjThe average temperature for j months. We can decompose the matrix into a matrix of lengthaavectorxx of ax (cities at lower latitudes have lower temperatures) and length isbbvectoryy of by (the temperature is lower in autumn and winter months), ifC = x × y C=x\times yC=x×y , thenCCC is a matrix of rank 1. Otherwise, the rank is the representation matrixCCThe number of minimum vector product sums required by C , that is , C = x 1 × y 1 + ⋯ + xr × yr C=x_1\times y_1 + \cdots + x_r \times y_rC=x1×y1++xr×yr. The latter vector product may correspond to the correction of urban height, changes in higher temperature, and other factors. The rank decomposition of a matrix is ​​to set a limit ddd , calculateddThe sum of the products of d vectors gives the ddof the best matrixd rank approximation. We extend this idea to the low-rank decomposition of tensors, which is theCanonical Polyadic (CP) method.

For knowledge graphs, we can define a three-dimensional one-hot encoding matrix, representing point-label edge-point, and then decompose it into the sum of multiple three-vector products, as shown in the figure below. Please add image description
However, the goal of our knowledge graph embedding is usually to assign a vector to each entity.

DisMult is a pioneering method for computing knowledge graph embeddings based on rank decomposition, where each entity and relationship is associated with a dimension of ddThe vector of d is associated, so for the edgePlease add image description
, design a plausibility scoring function∑ i = 1 d ( es ) i ( rp ) i ( eo ) i \sum_{i=1}^{d}\left(\mathbf{ e}_{\mathrm{s}}\right)_{i}\left(\mathbf{r}_{\mathrm{p}}\right)_{i}\left(\mathbf{e}_{ \mathrm{o}}\right)_{i}i=1d(es)i(rp)i(eo)i, the goal is to learn the vector of each node and edge to maximize the plausibility of positive edges and minimize the plausibility of negative edges. But this method has a shortcoming, sss -orientedooThe scoring function of o andooo points tossThe scoring function of s is the same, that is, there is no way to capture the direction information of the edge.

RESCAL uses matrices instead of vectors as relational embeddings, thus retaining the directionality of edges, but the calculation space and time are far beyond the DistMult method. ComplEx uses complex vectors and HolE uses circular correlation factors (real numbers) to preserve edge directions. Other decomposition methods include SimplE and TuckER . Among them, TuckER is the SOTA method currently on benchmarks.

Neural Models

Many methods use neural networks to learn knowledge graph embeddings with nonlinear scoring functions.

In addition to MLP , there are also convolution methods such as ConvE and HypER that continuously learn weights to generate an embedding representation method for one-hot code vectors based on weight parameters.

Language Models

The method of representing natural language is the classic embedding method, such as word2vec and GloVe .

Similarly, language embedding methods can also be applied to graphs. RDF2Vec performs a biased random walk on the graph and records the traversed paths as "sentences", which are then fed into the word2vec model as input. KGloVe is based on the GloVe model. Just like the original GloVe model considers words that appear frequently in a text window to be more relevant, KGloVe uses personalized PageRank to determine which nodes are most relevant to a given node, and then feeds the results into the GloVe model.

Entailment-aware Models

So far, embeddings have only considered data graphs. But what if a priori knowledge of a set of rules is provided? We can first consider using constraint rules to refine the predictions made by the embedding. For example, if we define that an event can have at most one venue value, it becomes less reasonable to assign multiple venues to an edge for an event.

Recent methods prefer to propose joint embeddings that consider both data graphs and rules, such as FSL and KALE .

Graph Neural Networks

Instead of representing graphs as mathematical vectors, an alternative is to define a custom machine learning architecture for the graph. Most such architectures are based on neural networks. However, traditional neural networks tend to have a more homogeneous topology, while the topology of graphs is usually more heterogeneous.

**graph neural network (GNN)** is a type of neural network that, unlike embeddings, supports end-to-end supervised learning for a specific task: given a set of labeled examples, a GNN can be used to perform tasks on elements of a graph or on the graph itself Classification. GNN has been used to classify graph-encoded compounds, objects in images, documents, etc., as well as predict traffic, build recommendation systems, verify software, etc. Given labeled examples, GNNs can even replace graph algorithms. For example, GNNs have been used to find central nodes in knowledge graphs in a supervised manner.

We will introduce two types of GNNs: recursive and convolutional.

Recursive Graph Neural Networks

Recursive graph neural networks (RecGNNs) are pioneering work in graph neural networks. It takes as input a directed graph with nodes and edges associated with static feature vectors that capture node and edge labels, weights, etc. Each node in the graph also has a state vector that is recursively updated using a parametric transformation function based on information from the node's neighbors (i.e., features and state vectors of adjacent nodes and edges). The parametric output function then calculates the final output of the node based on its own characteristics and state vector. These functions are applied recursively to a fixed point. Given a set of partially labeled nodes in a graph, a neural network can be used to learn these two parametric functions. Therefore, the results can be viewed as recursive (or even recurrent) neural network architectures.

An example is shown in the figure. Please add image description
In the graph, nodes are represented by feature vectors nx n_xnx(one-hot encoding of node types) and ttHidden state hx (t) h_x^{(t)}at time thx(t)Annotations are made, while edges are annotated with feature vectors a_{xy}axy(One-hot encoding of tag type) Comment. ff on the rightf andggThe g function corresponds to the parameter conversion function and the parameter output function respectively. To train the network, we can label which places have tourist offices and which do not. These labels can come from the knowledge graph or can be manually labeled. Then GNN learns two parameterswww sumw ′ w'w , which can then be used to mark other nodes.

Convolutional Graph Neural Networks

Both GNNs and CNNs work on local regions of the input data: GNNs operate on nodes in the graph and their neighbors. Following this intuition, many convolutional graph neural networks ( ConvGNNs ), also known as graph convolutional networks ( GCNs ), have been proposed, in which the transformation function is implemented through convolution.

One benefit of CNN is that the same kernel can be applied to all areas of the image, but this is not the case with ConvGNN, because unlike the case of images, the pixels of the image have a predictable number of neighbors, but the nodes in the graph can be diverse of. Approaches to these challenges involve using spectral representations of graphs to induce more regular structures from the graphs. Another approach is to use an attention mechanism to learn nodes whose features are most important to the current node.

Apart from architectural aspects, there are two main differences between RecGNN and ConvGNN.

  1. RecGNN recursively aggregates information from neighbors to a fixed point, while ConvGNN typically applies a fixed number of convolutional layers.
  2. RecGNN usually uses the same functions/parameters in a unified step, while different convolutional layers of ConvGNN can apply different kernels/weights in each different step.

Symbolic Learning

The supervised learning discussed so far makes it difficult to interpret numerical models, and the reasons for making reasonable predictions are implicit in the complex matrix of learned parameters. At the same time, embeddings also suffer from vocabulary issues, and they often fail to provide results for inputs of previously unseen nodes or edges. One solution is to use symbolic learning to explain the hypotheses of positive and negative sides in a logical (symbolic) language. Such assumptions are open to interpretation. Furthermore, they are quantifiable (e.g., “All airports are domestic or international”), partially solving the vocabulary problem.

There are two main forms of symbolic learning for knowledge graphs: rule mining for learning rules and axiom mining for learning other forms of logical axioms. This part mainly corresponds to the rules and logical language in Chapter 3, with less practical application.

Rule Mining

Rule mining generally refers to discovering meaningful patterns in the form of rules from a large collection of background knowledge.

While similar tasks for relational settings have been explored using Inductive Logic Programming (ILP), it is not clear how to define negative edges when dealing with incomplete knowledge graphs (under OWA). A common heuristic is to employ the partial completeness assumption (PCA) .

An influential graph rule mining system is AMIE , which adopts the PCA confidence measure and builds rules in a top-down manner. Later work built on these techniques to mine rules from knowledge graphs.

Another research direction is about differentiable rule mining, which achieves end-to-end learning of rules by using matrix multiplication to encode the connections in the rule body.

Axiom Mining

In addition to rules, more general axiom forms—expressed in logical languages ​​(such as DLs)—can be mined from knowledge graphs. We can divide these methods into two categories: methods that mine specific axioms (such as disjointness axioms, etc.) or general axioms (DL-Learner, etc.).

5. Summary and Conclusion

A knowledge graph is a common foundation of knowledge within an organization or community that enables the representation, accumulation, management, and dissemination of knowledge over time. Knowledge graphs have been used in a variety of use cases, ranging from commercial applications involving semantic search, user recommendations, conversational agents, targeted advertising, transportation automation, etc., to open knowledge graphs for the public good. General trends include:

  • Use knowledge graphs to integrate and leverage data from disparate sources at scale
  • Combined deduction (rules, ontology, etc.)
  • Inductive techniques (machine learning, analytics, etc.) to represent and accumulate knowledge

Beyond specific topics, more general challenges with knowledge graphs include scalability, especially for deductive and inductive reasoning; quality, not just in terms of data but also in terms of models derived from knowledge graphs; diversity, such as managing context or Multimodal data; dynamism, considering time or streaming data; and finally usability, which is key to increasing adoption. Although technologies are constantly proposed to accurately solve these challenges, they are unlikely to be completely solved; instead, they serve as dimensional indicators that the knowledge graph and its technologies, tools, etc. will continue to mature.

Guess you like

Origin blog.csdn.net/qq_43734019/article/details/127251416