Volcano engine DataLeap data lineage technology construction practice

For more technical exchanges and job opportunities, please follow the WeChat official account of ByteDance Data Platform and reply [1] to enter the official communication group

DataLeap is a big data R&D and governance suite product under the Volcano Engine digital intelligence platform VeDI. It helps users quickly complete the construction of a full set of data middle platforms such as data integration, development, operation and maintenance, governance, assets, and security, reducing work costs and data maintenance costs, and mining Data value provides data support for corporate decision-making.

Data lineage is the basic ability to help users find data, understand data, and make data valuable. This article will focus on data lineage storage and lineage export, share the model design and optimization of data lineage, and introduce the challenges and technical implementation encountered by ByteDance in the process of data lineage construction as well as specific use cases of data lineage, including the data lineage model. , data lineage optimization, data lineage use cases, and future prospects. Most of the data lineage capabilities and practices introduced in this article are currently provided to the outside world through the Volcano Engine DataLeap.

▌Experience 1: Layered architecture of data lineage model

1. Challenge

First, let’s introduce the challenges encountered by byte internal data lineage.

As the company's business expanded, the number of users continued to grow, and the data warehouse construction continued to improve, the types and quantities of metadata also experienced non-linear growth, and some problems emerged during this period.

First, scalability. Good scalability can ensure rapid access and iteration when facing new metadata origins, while poor scalability will require constant reconstruction to adapt to business changes, which will have a lot of impact on the business.

Second, performance. The insertion and update efficiency of a model itself will directly affect the data import and export process, which will bring a more intuitive business experience, so it is necessary to consider how to ensure the efficiency of the process.

Third, timeliness. Many application scenarios are particularly sensitive to accuracy. If the blood relationship data is delayed, it actually means that the blood relationship is inaccurate, which will have an impact on the business.

Finally, empower the business. Technology serves business, business growth will help technology upgrade and iteration, and technological innovation will also promote business development. Within Byte, we will consider business needs based on business characteristics, balance technical costs and business benefits, and ultimately make data model decisions. All in all, there is no perfect solution for data models, only the data lineage solution that is most suitable for the company's own business and suitable for the current stage.

2. Data lineage model-presentation layer

There are many types of metadata inside Byte, including the traditional online and offline data warehouse Hive, the OLAP analysis engine ClickHouse, and real-time side metadata, such as Kafka, ES, and Redis. The tables/Topics corresponding to these metadata are uniformly maintained on the metadata platform. Currently, the blood relationship display layer uses these data assets as the main perspective.

As shown in the figure below, the central data assets include information such as common fields and partition fields. You can also see the upstream and downstream asset information of the central assets from the figure. The edges connecting assets in the figure represent the production relationship: one task reads the upstream assets and generates the downstream assets.

3. Data lineage model-abstraction layer

Next, we will introduce how the volcano engine DataLeap designs the abstraction layer.

The abstraction layer is the data model of the entire data lineage. It mainly contains two types of nodes, one is the asset node and the other is the task node.

In the figure, asset nodes are represented by circles and task nodes are represented by diamonds. A specific example:

  • A FlinkSQL task consumes a Kafka topic and then writes it to a Hive table. Then the Kafka topic and hive table are the table asset nodes, and the FlinkSQL consumption task is the intermediate task node.
  • A Kafka topic may define its own schema, including multiple fields. For example, the schema contains fields a, b, and c. Through FlinkSQL tasks, such as a SQL: insert into hiveTable select a, b, c from kafka Topic, through After such processing, fields a, b, c and field d of this hive have a blood relationship.
  • Create a subtask node and connect several field nodes. Each subtask node will be connected to the subtask node through the edge of the affiliation relationship. The field node and each table asset node will also be connected to the edge of the affiliation relationship. There will be an edge connection between the task itself and the asset in terms of consumption and production.

The above is the presentation of the entire bloodline data model at the abstract level.

This design has the following benefits:

First of all, the abstraction of task assets is an abstraction of the broad and direct task relationships on the production platform and on various task platforms. When accessing new metadata or new task types, we only need to expand the current abstract asset nodes and The task node can access the bloodline corresponding to the newly added task link into the storage. This data model can also easily update and delete blood links to maintain timeliness.

Secondly, in the blood relationship construction within Byte, there are still difficulties in accessing various blood links. Based on the current design, development costs can be reduced. When updating the blood relationship, you only need to update the central task node, and update and delete the edges of the sub-task nodes corresponding to the central task node accordingly, thus completing the insertion and update of the blood relationship information. .

4. Data lineage model-implementation layer

At the implementation layer, the Volcano Engine DataLeap is mainly implemented based on Apache Atlas. Apache Atlas itself is also a data management product. It predefines some metadata types, and the entire type system has relatively good scalability. In the DataSet and Process metadata definitions of Atlas itself, we introduced the unique business metadata attributes and subtask definitions within Byte, and finally stored the task-related metadata.

Atlas itself also supports blood relationship query capabilities. Through the interface exposed by Apache Atlas, it can be converted into an edge on the graph corresponding to the blood relationship of a node to realize blood relationship query.

5. Data lineage model-storage layer

In the storage layer, it is currently mainly based on Apache Atlas native graph database - JanusGraph. The underlying layer of JanusGraph supports HBase. We take the relationship of each edge as an attribute of the asset nodes on both sides and store it in an independent cell corresponding to the RowKey.

In addition, we have also made related improvements to storage, such as Byte's internally developed key-value storage that separates storage and calculation. We will also do lightweight deployment in an independent environment, and based on performance or cost, as well as deployment complexity, we will switch the storage to an OLTP database, such as a MYSQL database.

The above is the design part of the entire data lineage model. Through such a data lineage model, we can reduce the development cost of new data lineage link access, and it is also convenient to update and delete lineage.

▌Experience 2: Three data lineage optimization directions

The second part will mainly introduce the typical data lineage optimization in the Volcano Engine DataLeap, including real-time data lineage update optimization, lineage query optimization and lineage data open export.

1. Real-time data lineage optimization

First, the update of real-time data lineage. The current update method of data lineage within the byte is through the T+1 link and real-time link. Since there are many internal scenarios that have particularly high timeliness requirements, if the data lineage update is not timely, it will affect the lineage accuracy and even affect business usage.

From the beginning of the architecture design of Data Lineage, T+1 import has been supported, but the timeliness is always based on daily cycles.

  • The data lineage task periodically pulls the configuration information of all running tasks, and calls the platform's API to pull the corresponding task-related configuration or SQL.
  • For SQL type tasks, the parsing capabilities provided by another parsing engine service will be called to parse the data lineage information.
  • Then it matches the asset information registered on the metadata platform, and finally constructs the upstream and downstream of a task asset node, and updates the edge between the task asset node and the table asset node into the graph database.

When updating in real time, we have two options:

Option 1: On the engine side, that is, when the task is running, the lineage information generated by the task after building the DAG is sent through the task execution engine through Hook.

  • Advantages: Bloodline collection on the engine side is relatively independent, and each engine will not affect each other when collecting bloodline.

  • shortcoming:

    • Each engine needs to be adapted to a hook for blood collection. One problem that some small and medium-sized enterprises may face on the engine side is that the same engine may have multiple versions running online, so the cost of adaptation will be relatively high, which requires Each version is adapted once.
    • Hook is also somewhat intrusive and will put a certain burden on your own work.

Option 2: Send the task change message on the task development platform. When the life cycle of the task changes, use the Hook message to register the task status change message by calling the API or send it to MQ for decoupling. The blood relationship service receives After receiving this notification, the resolution service is actively called to update the task lineage.

  • Advantages: It has good scalability and will not be restricted by the engine. When you want to access a new engine in the future, you only need to create the corresponding task on this task platform and send the message of the task change to get the notification of the bloodline update. , and then update bloodline.
  • Disadvantages: There will be a certain transformation cost for the blood relationship analysis service platform, and messages between tasks may affect each other.

Comprehensive comparison, we adopted the second solution and introduced MQ to further reduce the coupling between the task platform and the blood platform. This approach may sacrifice part of the delay, but it will make the entire link more reliable and ultimately reduce the The overall delay on Bloodline's side has been reduced from days to minutes.

The above is our optimization of blood relationship timeliness.

2.Data query optimization

The second optimization point is query. Currently, byte data lineage query relies on Apache Atlas. When using this blood relationship query service, there is a very common scenario, which is the scenario of multi-node query. In the process of impact analysis, we often query the blood relationship of all fields in a table, which will be converted into querying the upstream and downstream relationships of multiple nodes, which requires solving the problem of query efficiency.

There are two basic solutions:

One is to encapsulate it directly in the application layer and add an interface to the exposure layer of the Apache Atlas blood service, such as executing a single query through loop traversal. The content of this transformation is very small, but in fact the performance is not improved, and the implementation Rather violent.

Another way is to modify the call of the Apache Atlas bloodline service to the gallery query. Because Atlas uses JanusGraph as the underlying implementation, it provides a part of the abstraction, but only exposes single-node queries and does not have a batch query method. We also need to adapt to the batch query interface of JanusGraph to achieve the speed-up effect. .

Therefore, we have added a new batch query method to the operation entrance of the graph database. In this way, we can perform batch queries on blood nodes to further improve performance. At the same time, after Atlas retrieves the blood node, it needs to perform a mapping to a specific entity to retrieve some of its attributes. In this process, we also added an asynchronous batch operation method to further improve performance. After optimization, our efficiency can be significantly improved when querying table asset nodes or querying table assets or corresponding columns that are highly referenced.

3. Open export of bloodline data

The third optimization point is to provide multiple methods for exporting blood relationships. In addition to the ability to visually query blood relationships on the page, we have also successively provided many ways to use blood relationships, including downloading to Excel or querying the blood relationship data. Export the data warehouse table, or directly use the open API on the service platform side. You can also subscribe to the topic of blood relationship changes to directly monitor blood relationship changes. Downstream users can adjust accuracy and coverage based on their own development scenarios and business requirements. requirements to decide which method to use to consume bloodline data.

▌Experience 3: Analysis of four major data bloodline use cases

The next third part mainly introduces the specific use cases of data lineage and introduces how data lineage is used internally by Byte. Typical usage areas of Byte internal data lineage use cases mainly include: asset field, development field, governance field and security field.

1. Data lineage use case – asset field

First of all, in the asset field, data lineage is mainly used in the calculation of asset popularity. When calculating asset popularity, some assets will be frequently consumed and widely cited. The fact that an asset is cited by many downstream is a reflection of its own authority, and the proof of this authority requires a quantitative measurement, so the concept of "asset popularity" needs to be introduced. The asset popularity itself is implemented with reference to the page ranking algorithm PageRank algorithm. At the same time, we also provide the asset popularity value. According to the downstream dependency of the asset, the popularity value of the asset reference is defined. If the reference popularity value of an asset is higher, the It means that this asset should be trusted more and the data is more reliable.

In addition, ancestry can also help us understand the data. For example, when a user queries a data asset node on the metadata platform or the bloodline platform, he may want to carry out the next step of job development or troubleshoot some problems, so he needs to find the data asset first. If users do not understand the process of data generation, they will not be able to understand the past and future of the data. This is a classic question in philosophy: Where did this table come from? What exactly does it mean? We can find the upstream and downstream information of a specific table through data lineage.

2. Data lineage use case – development area

The second use case for data lineage is in the development space. There are two applications in the development world: impact analysis and attribution analysis.

  1. Impact Analysis Application

Impact analysis is ex-ante analysis, which means that when table assets change, the impact can be perceived in advance. When the person in charge of the assets upstream of Bloodline modifies the corresponding production task, he needs to view the assets downstream through Bloodline to determine the impact of the asset modification, so as to complete notifications and other operations based on the compatibility of the modification or the importance of a certain link. Otherwise, serious production accidents may occur due to lack of notification.

  1. Attribution analysis application

Attribution analytics applications are post hoc analysis. For example, when there is a problem with the table generated by a certain task, we can query the upstream of the bloodline and find the task nodes or asset nodes that have been changed in the bloodline upstream step by step to find out the root cause of the problem. After discovering and locating the problem, we will repair the data. When repairing the data, we can find the dependencies of tasks or tables through lineage. For offline data warehouses, we may need to rerun the output data of a certain partition. We need to delineate the scope based on blood relationship, and only need to trace back the corresponding affected downstream tasks to reduce unnecessary waste of resources.

3. Data lineage use case – governance area

In applications in the field of governance, blood relations also have typical usage scenarios within Byte: link status tracking and data warehouse management.

  1. Link status tracking

For example, during important festivals or events, we need to select some tasks that require important guarantees in advance. At this time, we need to sort out the backbone of the link, that is, the core link, through blood relationships. Then go to the corresponding key governance and guarantees, such as signing an SLA.

  1. Data warehouse management

Data lineage will also be used to assist in the construction of data warehouses, such as standardized management. Standardized management of data warehouses includes cleaning up unreasonable references in data warehouse stratification, irregular data warehouse stratification, redundant tables, etc. For example, two tables from the same upstream table but belonging to different levels are redundant and will be cleaned through data lineage assistance.

4. Data lineage use cases – security domain

Security-related issues are more common in some multinational companies or international products, and the security policies of each country and region are different. When we do security compliance inspections, each asset has a corresponding asset security level. This asset security level will have certain rules. For example, we stipulate that the security level of downstream assets must be higher than the upstream security asset level, otherwise it will There will be permission leakage issues or other security issues. Based on the blood relationship, we can scan the downstream assets involved in these rules, configure the corresponding scanning rules, and then conduct security compliance inspections to make corresponding governance.

In addition, kinship is also used in label propagation. Automation can be carried out through the kinship propagation link. For example, when security labeling is performed on assets, the manual labeling method is relatively cumbersome and requires attention to the link information. , then you can use the blood relationship information to complete automatic marking, such as configuring some rules to let the security label clarify the scene, node and termination rules.

The above are some typical use cases of data lineage within Byte, and we are also exploring more usage scenarios.

The scenes are divided into areas based on their requirements for bloodline quality. According to the requirements of blood relationship coverage and blood relationship accuracy, it can be divided into four quadrants. For example, one category needs to cover the entire link and has extremely high blood relationship accuracy requirements, such as the two use cases of the development project, because in the development project In the use case, the delay of blood relationship will seriously affect the decision-making judgment, and the quality of blood relationship is the highest requirement.

The bloodline construction process will also be divided into different construction periods. We can assist in formulating the bloodline construction plan based on the business scenarios and business priorities to be supported now, and determine the rhythm and specific direction of the bloodline iteration.

▌Future Outlook

1. Data lineage technology trends

In the industry, the development trend of bloodline mainly focuses on the following points:

Universal ancestry analysis ability

Bloodline is the core capability of the metadata platform. In many cases, the metadata platform will access diversified metadata. These business metadata will also rely on bloodline's different bloodline parsing capabilities. Today's analysis often relies on the support of each engine team, but In fact, in a wider range of scenarios, we need a comprehensive solution to provide a more universal bloodline parsing capability, so in the future we will provide a standard SQL parsing engine to achieve the purpose of universal parsing.

Non-invasive, non-SQL type ancestry collection

In addition to parsable SQL or configurable tasks, code-type tasks are also involved daily, such as JAR tasks. The current parsing method of JAR tasks is to complete the collection of lineage based on some buried information or upstream and downstream information entered by the user. In the future, a non-invasive, non-SQL type lineage collection technology will appear, such as Flink or Spark's JAR. Task, we can get these blood relationships when the task is running to enrich the data of blood relationships on the platform side.

Chronological lineage

Timing lineage is also a consideration within the byte. The current bloodline information graph database is equivalent to a snapshot of the current bloodline topology. In fact, bloodline will change. For example, when a user modifies a task, the online task changes or the table structure is modified, and then the corresponding production task is modified. This involves As for the concept of timing, this timing can facilitate us to trace changes in some tasks and support us to do pre- and post-impact impact analysis. Therefore, how to introduce timing kinship into graph databases is also a future trend.

2. Application trends of data lineage

standardization

As mentioned earlier, the underlying capabilities of many application scenarios are obtained through interfaces. Obtaining the data from the interface also involves the standardization of applications. Standardized applications can allow us to transplant it to more businesses and provide better help with bloodline data analysis.

End-to-end blood connection

Another application trend is end-to-end bloodline capabilities. Now the platform mainly accesses asset nodes, and end-to-end will involve further upstream, such as data collected by the App and Web, or downstream reports, and the final nodes after the API. . In blood relationship collection, this part of information is currently missing, and end-to-end blood relationship connection will be one of the trends in future applications.

3. Full-link bloodline capabilities on the cloud

Within ByteDance, bloodline capabilities will be moved to the cloud, which involves various data types. Therefore, one of the development directions of bloodline is to uniformly access various heterogeneous data types and support cloud users to customize access to new types. Bloodline.

At the same time, when data applications are standardized, blood relationship applications can also be provided to cloud users, and cloud users can also join in the development of blood relationship applications. Finally, the data blood relationship model is promoted as a standard, and derived from this Better blood relationship application and blood relationship service ecology.

Most of the data kinship capabilities and practices introduced in this article have been provided to the outside world through the Volcano Engine DataLeap. You are welcome to click to read the original text to experience it.

Click to jump to Big Data R&D Governance Suite DataLeap to learn more

JetBrains releases Rust IDE: RustRover Java 21 / JDK 21 (LTS) GA With so many Java developers in China, an ecological-level application development framework .NET 8 should be born. The performance is greatly improved, and it is far ahead of .NET 7. PostgreSQL 16 is released by a former member of the Rust team I deeply regret and asked to cancel my name. I completed the removal of Nue JS on the front end yesterday. The author said that I will create a new Web ecosystem. NetEase Fuxi responded to the death of an employee who was "threatened by HR due to BUG". Ren Zhengfei: We are about to enter the fourth industrial revolution, Apple Is Huawei's teacher Vercel's new product "v0": Generate UI interface code based on text
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5588928/blog/10107436