Construction of knowledge graph

The most important core of building a knowledge graph lies in the understanding of the business and the design of the knowledge graph itself. This is similar to the design of database tables for a business system, and this kind of designer is based on the business and future scenarios. Change estimates are derived from continuous exploration.

The construction of a complete knowledge graph includes the following steps:

1. Define the business problem 

 2. Data collection & preprocessing  

3. Knowledge graph design  

4. Knowledge graph data storage  

5. Development of upper-layer applications and system evaluation.

1. Define specific business problems

In the P2P online lending environment, the core issue is risk control, that is, how to evaluate a borrower's risk. In the online environment, fraud risks are particularly serious, and many of these risks are hidden in complex relationship networks, and the knowledge graph is designed for this type of problem, so we "may" expect it to be used in fraud , brings some value to this issue.

Before entering the discussion of the next topic, one thing to be clear is whether you need the support of the knowledge graph system for your own business problems. Because in many practical scenarios, even if there is a certain need for relationship analysis, traditional databases can actually be used to complete the analysis. Therefore, in order to avoid using knowledge graphs and choose knowledge graphs, as well as better technology selection, several summaries are given below for reference.

Bainiu Data Company's products, through in-depth research on the business and the support of big data, choose knowledge graphs to display problems with multiple scenarios and needs, making it easier for customers to locate corporate problems more quickly and accurately.

2. Data collection & preprocessing

The next step is to determine the data source and do the necessary data preprocessing. Regarding data sources, we need to consider the following points: 1. What data do we already have? 2. Although it is not available now, what data is possible to obtain? 3. Which part of this data can be used to reduce risk? 4. Which part of the data can be used to build a knowledge graph? One thing that needs to be explained here is that not all data related to anti-fraud must enter the knowledge graph.

For anti-fraud, there are several data sources that we can easily think of, including basic user information, behavioral data, operator data, public information on the Internet, etc. Assuming that we already have a list of data sources, the next step is to see which data needs further processing. For example, for unstructured data, we more or less need to use technologies related to natural language processing. The basic information filled in by users will basically be stored in the business table. Except for individual fields that require further processing, many fields can be directly used for modeling or added to the knowledge graph system. For behavioral data, we need to go through some simple processing and extract effective information from it, such as "how long the user stayed on a certain page" and so on. For web page data published on the Internet, some information extraction related technologies are needed.

For example, for the user's basic information, we are likely to need the following operations. On the one hand, user information such as name, age, education and other fields can be directly extracted and used from the structured database. But on the other hand, we may need to do further processing for the filled in company name. For example, some users fill in "Beijing Jingdong Co., Ltd." and other users fill in "Beijing Jingdong Century Trading Co., Ltd.", which actually refers to the same company. Therefore, at this time we need to align the company name. For the technical details used, please refer to the entity alignment technology mentioned earlier.

3. Design of knowledge graph

Graph design is an art. It requires not only a deep understanding of the business, but also a certain estimate of possible changes in the business in the future, so as to design a system that is closest to the current situation and has efficient performance. When it comes to knowledge graph design, we will definitely face the following common questions: 1. What entities, relationships and attributes are needed? 2. Which attributes can be used as entities, and which entities can be used as attributes? 3. What information does not need to be placed in the knowledge graph?

Based on these common problems, we abstracted a series of design principles from past design experience. These design principles are similar to the paradigms in traditional database design, guiding relevant personnel to design a more reasonable knowledge graph system while ensuring the efficiency of the system.

Business Principle , which means "Everything must start from business logic, and by observing the design of the knowledge graph, it is easy to infer the business logic behind it, and the possible future changes in the business must also be considered during the design."

For example, you can look at the diagram below and ask yourself what the business logic behind it is. Through some observation, it is actually difficult to see what the business process looks like. To give a simple explanation, the entity here - "application" means application. If you know something about this field, it is actually the entry entity. In the figure below, what do "has_phone" and "parent phone" between the application and phone entities mean?

Next, look at the picture below. The difference from the previous one is that we extracted the applicant from the original attributes and set it as a separate entity. In this case, the entire business logic becomes very clear. We can easily see that Zhang San applied for two loans, and Zhang San has two mobile phone numbers. When applying for one of the loans, he filled in his parents’ phone number. Number. All in all, a good design makes it easy for people to see the logic of the business itself.

The Efficiency Principle makes the knowledge graph as lightweight as possible and determines which data is placed in the knowledge graph and which data does not need to be placed in the knowledge graph. In classic computer storage systems, we often talk about memory and hard disks. Memory serves as an efficient access carrier and is the key to running all programs. This storage hierarchical structure design stems from the locality of data - "locality", which means that frequently accessed data is concentrated in a certain block, so this part of the data can be placed in memory to improve access efficiency. . Similar logic can also be applied to the design of knowledge graphs: we store commonly used information in knowledge graphs, and put information that is not accessed frequently and is not important for relationship analysis in traditional relational databases. The core of the efficiency principle is to design the knowledge graph into a small and light storage carrier.

For example, in the knowledge graph below, we can put some information such as "age" and "hometown" into a traditional relational database, because these data are not very useful for: a. analyzing relationships b. access frequency Low, putting it on the knowledge graph will affect efficiency.

In the Analytics Principle, entities that are not related to relationship analysis do not need to be placed in the graph;

The Redundancy Principle allows some repetitive information and high-frequency information to be placed in traditional databases.

4. Store data in the knowledge graph

In terms of storage, we have to choose a storage system, but since the knowledge graph we designed has attributes, the graph database can be the first choice. But as for which graph database to choose, it also depends on the business volume and efficiency requirements. If the amount of data is particularly large, Neo4j is likely to be unable to meet business needs. At this time, you have to choose a system that supports quasi-distributed systems such as OrientDB, JanusGraph, etc., or store the information in a traditional database through efficiency and redundancy principles, thereby Reduce the amount of information carried by the knowledge graph. Generally speaking, Neo4j is sufficient for graphs with a scale of less than 1 billion nodes.

5. Development of upper-layer applications

After building the knowledge graph, it is necessary to use it to solve specific problems. For risk control knowledge graphs, the first task is to discover fraud risks hidden in the relationship network. From an algorithmic perspective, there are two different scenarios: one is rule-based; the other is probability-based. In view of the current status of AI technology, rule-based methodologies still dominate applications in vertical fields. However, as the amount of data increases and methodologies improve, probability-based models will gradually bring greater value.

1. Rule-based application

Next, several rule-based applications are introduced, including inconsistency verification, rule-based feature extraction, and pattern-based judgment.

Inconsistency verification: In order to judge the risks existing in the relationship network, a simple method is to do inconsistency verification, that is, to find potential contradictions through some rules. These rules are manually defined in advance, so designing rules requires some business knowledge. For example, both Li Ming and Li Fei indicated the same company phone number, but in fact it is judged from the database that they actually work for different companies. This is a contradiction. There can actually be many similar rules that are not listed here.

Feature extraction based on rules: We can also extract some features from the knowledge graph based on rules, and these features are generally based on depth searches such as 2 degrees, 3 degrees or even higher dimensions. For example, we can ask a question like: "How many entities in the applicant's second-degree relationship have touched the blacklist?" After these features are extracted, they can generally be used as input to the risk model. I still want to explain here that if the features do not involve deep relationships, in fact, the traditional relational database is sufficient to meet the needs.

Pattern-based judgment: This method is more suitable for finding group fraud. Its core is to find groups or sub-graphs that may be risky through some patterns, and then conduct further analysis on this sub-graph. . 

2. Probability-based methods

In addition to rule-based methods, probabilistic and statistical methods can also be used. Technologies such as community mining, tag propagation, and clustering all fall into this category.

The purpose of community mining algorithms is to find some communities from the graph. For communities, we can have many definitions, but intuitively it can be understood that the density of relationships between nodes within a community is significantly greater than the density of relationships between communities. Since community mining is based on a probability methodology, the advantage is that there is no need to manually define rules. Especially for a huge relationship network, defining rules itself is a very complicated matter.

The core idea of ​​the label propagation algorithm lies in the transfer of information between nodes. This is similar to the principle that when you are with good people, you will gradually become better. Because through this relationship, you will continue to absorb high-quality information, and eventually you will become better without knowing it.

Compared with rule-based methodologies, the disadvantage of probability-based methods is that they require sufficient data. If the amount of data is small, the entire graph will be sparse (Sparse), so the rule-based method can become our first choice.

3. Analysis based on dynamic network

All the above analyzes are based on static relationship graphs. The so-called static relationship graph means that we do not consider the changes in the graph structure itself over time, but only focus on the current knowledge graph structure. However, we also know that the structure of graphs changes over time, and that these changes themselves can be associated with risk.

Guess you like

Origin blog.csdn.net/WhiteCattle_DATA/article/details/133038553