Data Governance: Data Bloodlines!

Data lineage is the core capability of metadata products, but data lineage is a typical technology that looks beautiful but has a high barrier to use. As long as you have purchased metadata products, you will know. This article systematically elaborates on the characteristics, value, uses and methods of data lineage:

1. Features: attribution, multi-source, traceability and hierarchy

2. Value: data value assessment, data quality assessment and data lifecycle management

3. Purpose: compliance requirements, impact analysis and quality problem analysis, data security and privacy, migration projects and self-service analysis

4. Methods: automatic analysis, system tracking, machine learning methods and manual collection

From the point of view of data bloodline acquisition methods, automatic parsing is basically unreliable, machine learning methods are still in the conceived stage, manual collection timeliness, consistency is poor, and system tracking greatly depends on standardized management capabilities and tool integration capabilities. But it is a method I agree with. To establish data blood relationship, you must be specific and scene-oriented, and start small. Don’t think about building a full blood relationship all at once. Anything that is idealized will end up being lonely in the end.

start of text

The blood relationship of data is well understood from a conceptual point of view, that is, in the whole life cycle of data, various relationships will be formed between data and data. These relationships are similar to human blood relationship, so they are called data blood relationship .
insert image description here

From a technical point of view, data a generates data b through ETL processing, then we will say that data a and data b have a blood relationship. However, unlike human kinship, data kinship also has some personal characteristics.

·Attribution

Data is owned by a specific organization or individual, and the organization or individual that owns the data has the right to use the data to achieve marketing, risk control, and other purposes.

·Multiple sources

This feature is fundamentally different from the blood relationship of human beings. The same data can have multiple sources (that is, multiple fathers). The sources include that the data is generated by multiple data processing, or by multiple processing methods or processing steps. .

·Traceable

The blood relationship of data reflects the full life cycle of data, and the whole process from data generation to disposal can be traced.

· Hierarchy

The blood relationship of data has a hierarchical relationship, just like in traditional relational databases, the user is the highest level, followed by databases, tables, and fields. They are from top to bottom. One user owns multiple databases, and one database stores There are multiple tables, and there are multiple fields in one table. They are organically combined to form a complete data blood relationship.

The ER diagram example of the background database of a school student management system is shown in the figure below. The fields such as the student ID, name, gender, date of birth, grade, and class of the student form the student information table, and the student information table, teacher information table, and course selection table One or more associated fields form the database of the entire student management system background.

insert image description here

Whether it is structured data or unstructured data, they all have data kinship, and their kinship can be simple and direct or intricate, and can be traced through scientific methods.

Taking the financial indicators of a certain bank as an example, the calculation formula of net interest income is interest income minus interest expense, and interest income can be divided into customer business interest income, capital market business interest income and other business interest income, customer business Interest income can be subdivided into credit business interest income and other business interest income. Credit business interest income can also be subdivided into interest income from multiple business lines and business segments.

Such subdivision can be traced back from financial indicators to original business data, such as customer weighted average loan interest rate and new loan balance. If the net interest income indicator finds a data quality problem, the root cause can be found at a glance through the figure below.
insert image description here

Data lineage tracing is not only reflected in the calculation of indicators, but also can be applied to the lineage analysis of data sets. Whether it is a data field, a data table, or a database, there may be a blood relationship with other data sets. Analyzing the blood relationship is helpful to improve data quality, and at the same time, it is helpful for data value evaluation, data quality evaluation, and subsequent data life cycle management. There is also greater help and improvement.

From the perspective of data value evaluation, through sorting out the blood relationship of data, it is not difficult to find that the owners and users of data, in simple terms, when there are fewer data owners and more users (data demanders) , the value of the data is higher. In data flow, the value of data sources that have a greater impact on the final target data is relatively high. Similarly, data sources with high update and change frequencies generally play a higher role in the calculation and summary of target data, so it can be judged that these data sources have high value.

From the perspective of data quality assessment, clear data sources and processing methods can clarify the quality of data at each node.

From the perspective of data life cycle management, the blood relationship of data helps us judge the life cycle of data and is a reference for data archiving and destruction operations.

Considering the importance and characteristics of data lineage, generally speaking, we will focus on the relationship between data at the application (system) level, program level, and field level when analyzing lineage. More commonly, data is exchanged and transmitted through interfaces between systems.

For example, in the figure below, the data in the banking business system is transferred and distributed to traditional relational databases and non-relational big data platforms by a unified data exchange platform. It involves a lot of data processing and data exchange work:
insert image description here

When analyzing the blood relationship among them, the following aspects are mainly considered:

1. Comprehensiveness

As shown in the figure above, the data processing process is actually a process in which the program transfers, deduces, and archives the data. Even the archived data may affect the results of the system or flow to other systems in other ways. In order to ensure the coherence of data flow tracking, the entire set of systems must be the object of analysis.

2. Static analysis method

The advantage of this method is that it avoids the influence of human factors, and the accuracy is not affected by the detailed level of document description, test cases and sampled data. The involved paths are statically analyzed and listed to achieve an objective reflection of data flow.

3. Contact infection analysis method

By screening program commands related to data transmission and mapping, key information can be obtained for in-depth analysis.

4. Logic timing analysis method

In order to avoid the interference of redundant information, according to the program processing flow, the indirect process of transfer and mapping that has no direct relationship with the database, file, and communication interface data fields and the intermediate variables of the program are converted into data fields between the database, files, and communication interface. Direct delivery and mapping of .

5. Timeliness

In order to ensure the availability and timeliness of data field association relationship information, it is necessary to ensure the synchronization of query version updates and data field association information, and achieve "what you see is what you get" within the entire system.

Generally speaking, the use of data lineage mainly reflects the following aspects:

1. Compliance requirements: This is the requirement of the regulatory authorities. In order to regulate compliance, all points and sources of data flow need to be supervised.

2. Impact analysis and quality problem analysis: This is the core requirement of the data development department. With more and more data applications, the flow chain of data is getting longer and longer. If the core business changes at one source, the downstream analysis applications must maintain Synchronization, without affecting the analysis, will cause abnormal access to each data service.

3. Data security and privacy: This is the requirement of the data compliance department, which data needs to be desensitized, and it is necessary to maintain the control of all domains in full circulation.

4. Migration project: This occurs when a specific old project is terminated and needs to be taken over by a new project. Without a data flow mapping table, it will take a lot of time to sort it out, and it is difficult to ensure the integrity and correctness of the migration.

5. Self-service analysis: In order for the data analysis team to determine the credibility of the data, the source of the data is an important basis for the credibility of the data.

The construction and maintenance of the data lineage system is a relatively heavy system engineering. The author believes that it is a place of quicksand in the work of data governance. If you are not careful, you will fall into this pit, especially the person in charge of the technical perfect personality type. This is because There are many factors to consider when working with data lineage. In order to minimize the risk of project failure, we need to consider the service users of data lineage, determine the priority of lineage in business and technical aspects, we need to consider the level of detail, coverage, frequency of change, and also consider personnel flow, organizational departments , technical architecture and other situations, formulate the most suitable strategy for us.

There are mainly the following methods for data collection:

1. Automatic analysis

Automatically parse the current main collection methods, specifically parsing SQL statements, stored procedures, ETL procedures and other files. Due to complex codes and application environments, according to the experience of international manufacturers, automatic analysis can cover 70-95% of enterprise data, but it is currently impossible to achieve 100%. Go for extremely high coverage.
insert image description here

2. System tracking

This method is through the process of data processing flow, the processing main tool is responsible for sending data mapping. The great advantage of this is that the collection is accurate, timely, and fine-grained. However, the limitation is that not every tool can be integrated. This method is generally based on a unified processing platform, for example, Informatica can manage its own full data blood cycle.

3. Machine learning methods

This method calculates the similarity of data based on the dependencies between data sets. The advantage of this method is that there is no dependence on tools and businesses, and the accuracy of the disadvantage needs to be manually confirmed. Generally, 3-8 data can be analyzed and found. Work summary of Ali algorithm engineer.

4. Manual collection

In the whole project, generally 5% need to be done manually.

insert image description here

Most of the current data bloodlines are based on technical sorting, and generally serve the needs of technical personnel. As data services move toward the foreground, service business analysis and CDO business data lineage, there are currently related products that map technical metadata to business metadata through data semantic analysis, and release and share lineage in the form of business processes. Assisting business decision-making is one of the future development directions.

Disclaimer: The copyright of the article belongs to the original author and source. Because the real source cannot be found, such as the wrong source, or for the pictures, text, software/materials contained in the link used in the article, if the mark is wrong or involves infringement, please contact us to delete it.

Guess you like

Origin blog.csdn.net/weixin_39971741/article/details/131918281