[Practical Explanation] Data Bloodline Implementation

‍In the complex social division of labor and cooperation system, we need to clarify our personal positioning in order to better exert value, and the same is true for data. Therefore, the blood relationship of data came into being.

Today's article will explain data blood relationship in an all-round way, and give a specific implementation plan.

1. What is data lineage

Data lineage refers to the relationship between data and data generated during data processing and circulation .

Provides a means of exploring data relationships and tracing the path through which data flows.

2. The composition of data lineage

1. Data node

The nodes in the data lineage can be understood as entities in the data flow , which are used to carry data functional services . For example, databases, data tables, and data fields are all data nodes; in a broad sense, entities related to data services can be included in the bloodline graph as nodes, such as indicators, reports, and business systems.

Nodes are divided according to blood relationship, mainly in the following three categories: outflow node -> intermediate node -> inflow node

Outflow node:  data provider, source node of blood relationship.

Intermediate node:  The node with the most types in the blood relationship, which not only accepts inbound data, but also outflows data.

Incoming node:  The terminal node of the blood relationship, generally the application layer, such as visual reports, dashboards or business systems.

2. Node attributes

Attribute information of the current node, such as table name, field name, comment, description, etc.

3. Flow path

The data flow path indicates the inflow and outflow information of data by showing the three dimensions of data flow direction, data update magnitude, and data update frequency :

Data flow direction:  indicate the data flow direction by means of arrows

Data update magnitude:  The larger the data update magnitude, the thicker the line of blood relationship, indicating the higher the importance of the data.

Data update frequency:  The higher the frequency of data update, the shorter the line of blood, the more frequent the change, and the higher the importance.

4. Circulation rules - attributes

The flow rules reflect the changes in the data flow process , and the attributes record the operation content of the current path on the data. Users can view the path rules and attributes through the flow path. The rules can be direct mapping relationships or complex rules. For example:

Data mapping:  directly extract without making any changes to the data.

Data Cleansing:  Shows the screening criteria in the process of data flow. For example, it is required that the data cannot be empty, conform to a specific format, and so on.

Data conversion:  During the data transfer process, the data flowing out of the entity requires special processing before it can be accessed to the data demander.

Data scheduling:  reflects the scheduling dependencies of the current data.

Data Application:  Provide data for reports and applications.

3. Why do we need data lineage

1. The increasingly large data development leads to the confusion of the relationship between tables, and the sharp increase of management costs and usage costs

Data lineage creates the most essential needs. Big data development as a data collection and data service provider, huge data and chaotic data dependence lead to soaring management costs and usage costs.

2. Data value evaluation, data quality is difficult to promote

Prioritization of tables, inclination of computing resources, table-level data quality monitoring, how to formulate a clear and scientific standard.

3. What table should be deleted, what table cannot be deleted, and there is no basis for removal

Business library, data warehouse, intermediate library, development library, test library and many other library tables, whether there is data redundancy (must exist). And how to release storage resources?

4. One table was moved, but a bunch of tables were wrong

You changed the fields of a table, and when you woke up the next day, you found a bunch of abnormal task alerts in the email.

5. Attribution analysis, impact analysis , and recovery when ETL tasks are abnormal

Taking over the previous question, if there is a task exception or ETL failure, how do we locate the cause of the exception, perform an impact analysis, and quickly recover downstream affected nodes.

6. Scheduling dependency confusion

The confusion of data dependencies will inevitably lead to the confusion of scheduling tasks. How to build a robust scheduling dependency.

7. Data security audit is difficult to carry out

For industries with high security concerns such as banking, insurance, and government, data security - data leakage - data compliance needs to be focused.

Due to the existence of ETL link operations on the data , the data in the downstream table comes from the upstream table, so it is necessary to perform a security audit based on the full data link, otherwise the downstream data security level may be low, resulting in the leakage of some upstream core data.

4. What Data Bloodlines Can Do

1. Process positioning, traceability

Visually display the upstream and downstream dependencies of the target table at a glance.

2. Determine the scope of influence

The scope of influence can be determined by the number and type of downstream nodes of the current node, which can avoid the modification of the upstream table causing errors in the downstream table.

3. Evaluate data value and promote data quality

By summarizing and sorting the downstream nodes of all table nodes as the basis for data evaluation, you can focus on data nodes with a large number of outputs and add data quality monitoring.

4. Provide the basis for data removal

For example, if the following data nodes do not have any downstream output nodes and do not have any archiving requirements, you can consider removing them from the shelves.

5. Attribution analysis, quick recovery

When a problem occurs in a certain task, check the upstream node of the blood relationship to find out the root cause of the problem. At the same time, the task is quickly restored according to the downstream nodes of the current task node.

6. Sort out scheduling dependencies

Blood relationship nodes can be bound to scheduling nodes, and ETL scheduling can be performed through blood relationship dependencies.

7. Data security audit

The data itself has permissions and security levels, and the security level of downstream data should not be lower than that of upstream data, otherwise there will be a risk of permission leakage.

Based on blood relationship, by scanning the downstream of high-security nodes to check whether the permissions of downstream nodes are consistent with those of upstream nodes, security compliance risks such as permission leakage and data leakage can be eliminated.

5. Data lineage landing plan

At present, there are three main methods of landing data lineage systems and applications in the industry:

1. Adopt open source system:

Atlas , Metacat , Datahub _

The biggest advantage of using an open source system is that the input cost is low, but the disadvantages mainly include 

1. The adaptability is poor, and the open source solution cannot fully match the company's existing pain points.

2. The cost of the second release is high, and it needs to be customized and developed according to the open source version.

2. Manufacturer's charging platform:

Yixin Huachen , NetEase Shufan, etc.

This type of data platform will have a built-in data lineage management system, which has comprehensive functions and is easy to use. But it also has the following disadvantages:

1 , expensive

2. The ALL IN platform is required . In order to ensure the use of data lineage, all data services need to be migrated to the manufacturer's platform.

3. Self-built

Through the graph database, backend, and frontend self-built data lineage management system, the development investment of this solution is relatively large, but it has the following advantages

1. According to local conditions, the metadata and data lineage system can be customized and developed according to the core pain points.

2. Technology accumulation. For developers, developing a data lineage system from 0-1 can provide a deeper understanding of data services.

3. The platform is decoupled, independent of the data platform, and the development of data lineage will not affect normal business.

Next, let's talk about how to build a data lineage system by ourselves

6. How to build a data lineage system

1. Clarify needs and determine boundaries

Before the construction of the bloodline system, it is necessary to conduct demand research to clarify the main functions of the bloodline system, so as to determine the most detailed point granularity and entity boundary range of the bloodline system.

For example, whether the node granularity needs to be accurate to the field level or the table level. Generally speaking, table-level blood relationship can solve about 75% of the pain points.  The complexity of field-level blood relationship is much higher than that of table-level blood relationship. If the department has a small number of people, you can consider only accurate table-level blood relationship.

Common entity nodes include: task nodes, library nodes, table nodes, field nodes, index nodes, report nodes, department nodes, etc. The lineage system can expand data-related entity nodes, and view data trends from different scenarios, such as the lineage relationship between tables and indicators, indicators and reports. However, the scope of entity nodes needs to be clarified and cannot be expanded without limit.

After clarifying the requirements and determining the granularity and scope of the nodes, an accurate solution can be given according to the pain points, so that the blood relationship system will not become more and more bloated, and the ROI (input-output ratio) will be improved .

2. Build a metadata management system

All bloodline systems currently on the market need to rely on metadata management systems to exist.

As the basis of blood relationship, metadata is firstly used to build the relationship between nodes, secondly used to fill in the attributes of nodes, and thirdly, the application of the bloodline system needs to be based on metadata to maximize its value. Therefore, the prerequisite for building a blood relationship system must be a relatively comprehensive metadata.

3. Technology selection: graph database

At present, graph databases are usually used in the industry to store blood relationships.

For application scenarios with deep kinship relationships and many nesting times, relational databases must perform table join operations. The number of table joins increases with the depth of the query, which will greatly affect the response speed of the query.

In a graph database, the application does not need to use foreign key constraints to achieve mutual reference between tables, but uses the relationship as a connection springboard to query. The performance of querying the relationship is excellent, and it is more direct to use the graph to express the blood relationship .

The following figure shows the logical comparison between the graph database and the relational database when querying contacts:

4. Blood relationship entry: automatic analysis and manual registration

Automatic parsing:

After the metadata is obtained, first, according to the SQL extraction statement in the metadata table , the source table of the current table can be automatically obtained through the SQL parser [ SQL parser recommends jsqlparse ], and the blood relationship can be entered.

Manual registration:

If there is no SQL extraction statement in the current table, and the data source is manual import, code writing, SparkRDD , etc., and the source table cannot be determined automatically, we need to manually register the source table, and then enter the blood relationship.

5. Blood relationship visualization

After the bloodline system is built, in order to better reflect the value of bloodline and quantify the output, it is necessary to develop bloodline visualization, which is divided into two steps:

( 1 ) Link - attribute display:

According to the specific node, through the click operation, the link direction between the blood relationship nodes and the related node attribute information are displayed step by step.

( 2 ) Node operation:

Based on the visualized blood relationship nodes and the metadata attributes attached to the current node, we can imagine some automated operations such as:

Node scheduling: directly start the scheduling task of the current table node based on blood relationship 

Attribute modification: modify the metadata attribute of the current node through the front end and save it

6. Statistical analysis of blood relationship

After the data lineage is built, we can do some statistical analysis operations to view the distribution and usage of data from different levels, so as to support the business better, faster and clearer.

Taking our team as an example, in the course of our work, we need the following blood relationship statistics to support data services, for example:

The number of downstream nodes of data nodes is sorted, which is used to evaluate the value of data and its scope of influence

Query all upstream nodes of the current node for business traceability

Data node output report information detailed statistics, used for the shelf and update of the report

Query isolated island nodes, that is, nodes without upstream and downstream nodes, which are used as the basis for data deletion

7. Blood relationship drives business development

The data lineage construction is completed, the statistical analysis results are also available, and the business pain points are also clarified. Next, we can use the data lineage to drive the business to develop better and faster.

The blood-related businesses that our team is currently implementing are as follows:

( 1 ) Influence range warning:

Connect the blood relationship with scheduling tasks, monitor the scheduling tasks of the current blood node, and if the scheduling of the current node is abnormal, all downstream nodes of the current node will be alerted.

( 2 ) Abnormal cause detection:

It is better to connect the blood relationship with the scheduling tasks, monitor the scheduling tasks of the current blood node, and if the scheduling of the current node is abnormal, the direct upstream node of the current node will be given to detect the cause of the abnormality.

( 3 ) One-click recovery of abnormal links:

Based on the previous application, after the cause of the abnormality is located and repaired, the bloodline system can be used to restore all downstream node scheduling tasks of the current data node with one click, truly realizing one-click operation.

​( 4 ) Removal of supporting data:

At present, the team has accumulated 628 archived data tables based on the detection of isolated island nodes , that is, nodes without upstream and downstream nodes , saving 13% of storage space.

( 5 ) Data quality monitoring:

Sorting the number of downstream nodes output by all nodes in the current lineage can accurately determine the size of the influence range of a certain table, so that the data quality of the high-ranking table can be monitored based on this.

( 6 ) Data standardization monitoring:

If the current company has formulated naming conventions based on libraries, tables, and fields, we can detect all data nodes in the bloodline and match the naming conventions to obtain libraries, tables, and fields that do not meet the specifications for rectification.

Of course, this business can also be realized based on metadata only, and it is a blogger's forced sublimation to put it here.

( 7 ) Data security audit:

The team divides the data permission level of the current database table based on the weight of user rank, department, and operation behavior. The higher the permission level, the higher the security level of the current table.

The team monitors the security level of the entire data link based on blood relationship. If it is found that the security level of the downstream node is lower than that of the upstream node, it will issue an alarm and prompt for rectification. Ensure data breaches due to security level confusion.

8. Evaluation Criteria for Blood Lineage System

In the process of promoting data bloodline landing, users often ask: what is the quality of bloodline? Is the coverage scenario comprehensive? Can it solve their pain points? Is it useful to make it?

So I was also thinking, there are so many bloodline system solutions on the market, where are the core advantages of our self-built system, and what levels should be used to evaluate the pros and cons of the bloodline system, so our team quantified the following three technical indicators:

1. Accuracy

Definition:  Assuming that the actual input and output of a task are consistent with the upstream and downstream of the task in the lineage, neither missing nor redundant, the lineage of this task is considered to be accurate, and the proportion of tasks with accurate lineage to the total number of tasks is Genetic accuracy.

Accuracy is the core indicator of data pedigree, such as the scope of influence alarm, the lack of pedigree may cause important tasks to not be notified, resulting in online accidents.

In practice, we use two methods to detect problematic blood nodes as early as possible:

Manual verification:  Just as other systems are verified by constructing test cases, the accuracy of blood relationship can also be verified by constructing test cases. In actual operation, we will sample a part of the tasks running online, and manually check whether the analysis results are correct.

User Feedback:  It is a long process to verify the accuracy of the full blood collection, but when it comes to a certain business scenario of a certain user, the problem is much simplified. In actual operation, we will cooperate deeply with some business parties to verify the accuracy of blood relationship and fix problems.

2. Coverage

Definition:  When data assets are entered into the lineage system, it means that the data lineage covers the current data assets. The ratio of data assets covered by bloodlines to all data assets is bloodline coverage.

Bloodline coverage is a relatively coarse-grained indicator. As a supplement to the accuracy rate, users can know the currently supported data asset types and task types through the coverage rate, as well as the scope of each coverage.

Internally, we define the coverage index for two purposes. One is the data asset collection that we are more concerned about, and the other is to find the data asset collection that has not been covered in the current business process, so as to facilitate subsequent lineage optimization.

When the bloodline coverage is low, the application scope of the bloodline system must be incomplete. By focusing on the bloodline coverage, we can know the progress of bloodline implementation and promote the orderly implementation of data lineage.

3. Timeliness

Definition: The  end-to-end delay from the time node when data assets are added and tasks are modified, to the time when the newly added or changed blood relationship is entered into the blood lineage system.

For some user scenarios, the timeliness of blood relationship is not particularly important, and it is a bonus item, but there are some scenarios that are strongly dependent. The timeliness of different task types will vary.

For example: alarming and recovery of the scope of fault influence is one of the scenarios that require high real-time performance of bloodlines. If the lineage system can only regularly update the status of T-1 , it may lead to serious business accidents.

To improve the bottleneck of timeliness, the business system needs to be able to send task-related modifications in the form of notifications in near real time, and the lineage system will update them.

Guess you like

Origin blog.csdn.net/xljlckjolksl/article/details/132257526