Kangaroo cloud data stack DataOps data productivity practice, realize the automation and standardization of data process

In the process of helping enterprises carry out digital transformation practices, the Kangaroo Cloud product team found that many enterprises have the same problems in the data production link . Including that the data team focuses on the rapid delivery of business needs in a short period of time, and lacks a top-down data production management system internally. There are imperfections and irregularities in all levels of data standards, data production processes, and R&D specifications; many links Relying on manual operation, teamwork efficiency is low, business needs are slowly affected, and there is a large amount of repeated data construction; the model of first development and then governance often leads to increasingly heavy historical debts.

Enterprises at the forefront of digital transformation practice are actively looking for ways to improve data production efficiency, and the emergence of DataOps has formed a set of mature solutions from theory and practice to solve the above problems.

As a leading digital basic software and application service provider in China, Kangaroo Cloud Data Stack has provided data production efficiency improvement solutions for thousands of customers in the course of more than 7 years of research and development. In the process, it has also continuously integrated the concept of DataOps into Among its products, it has helped more and more enterprises successfully realize digital transformation and upgrading.

This article will share the agile and high-quality data productivity practices based on DataOps in the data stack , hoping to be helpful to everyone.

DataOps basic concepts

If we say that the rise of the data center represents the digital transformation of enterprises from process-driven to data-driven, from digital to intelligent. Then DataOps is an excellent concept or methodology to realize data center .

The concept of DataOps was proposed by Lenny Liebmann as early as 2014. In 2018, DataOps was officially included in Gartner's data management technology maturity curve, marking that DataOps was officially accepted and promoted by the industry.

The Institute of Information and Communications Technology and the Big Data Technology Standards Promotion Committee proposed at this year's Data Asset Management Conference that DataOps ( integration of data R&D and operations ) is a best practice for the full life cycle of data with the goal of maximizing value. It achieves data R&D by reconstructing the organization, process and tools of data production within the enterprise, comprehensively utilizing the three core technical capabilities of R&D management, delivery management, and data operation and maintenance, and the four guarantee capabilities of value operation, system tools, organization management, and security risks. The concepts of integration , agility, leanness, automation, intelligence, and explicit value of operations.

At present, domestic enterprises including Industrial and Commercial Bank of China, Agricultural Bank of China, Zhejiang Mobile, and China Unicom DataOps have successfully practiced DataOps and achieved a leap in data productivity.

Data stack data operation practice based on DataOps

Data Stack is a one-stop big data basic software created by Pocket Cloud . It includes a series of products including big data basic platform, big data development and governance, data intelligent analysis and insight , and integrates the DataOps data operation concept, with independent controllability, security innovation As the core of technology, it gathers, processes, manages, serves, and analyzes global data assets, providing enterprises with a safe, stable, and easy-to-use big data platform , gaining insight into digital opportunities, clarifying the direction of transformation, and creating new data value.

The DataOps practice route of the data stack is as follows:

file

The solution level data stack has accumulated a wealth of successful experience through the practice of banks, funds, securities, insurance, universities, government affairs, ports, manufacturing and other industries. Tailor-made design in terms of organizational transformation, technology selection and implementation path planning.

For the data governance process, Data Stack has commercialized the methodology that has been accumulated for many years. The following is a sharing of some specific operations combined with the product layer.

data integration

Data integration is the process of extracting data from data sources such as business systems, APIs, and files to the data stack big data platform in an offline or real-time manner. Whether the configuration of the extraction job is flexible and convenient, whether the tool can adapt to the various data sources of the enterprise, whether the data transmission is stable, whether there are errors or omissions, and whether the extraction performance is good or bad are the core concerns of all users. ChunJun, a self-developed distributed batch-flow integrated synchronization tool developed by Datastack, provides an excellent solution.

file

Based on the data integration implemented by ChunJun, it is possible to configure offline and real-time data synchronization tasks visually within 30 seconds , realize bidirectional synchronization of multi-source heterogeneous data, flexibly regulate synchronization performance by increasing concurrency and setting the upper limit of synchronization rate, and support system exceptions Breakpoint resume transmission of data synchronization after interruption , supports batch generation of synchronization tasks for the entire database, and also supports exception analysis of abnormally read or written data records in dirty data tables during synchronization.

After the data extraction is completed, the metadata will also fall into the metadata database of the data stack, and users can query the table metadata in the data map of the data asset.

file

Data standard definition, table construction specification design and standardized table construction

The data asset module can define the data standards of table fields, and define specifications from the root, code table, and field business attributes and technical attributes, so as to avoid problems such as the definition of the same field in different tables and inconsistent names. The data stack platform has built-in standard templates for some industries, and also supports one-click import of data standards to help users quickly establish and manage data standards.

file

The design of the table creation specification mainly supports the definition of the data warehouse level, the composition of the table name model elements of the level, and the content of the model elements, which are used to constrain the unified specification of the table name in the subsequent data model construction.

file

file

file

Based on the table building specification, standardize table building from assets . When configuring basic information, the platform will automatically associate the data warehouse level of the table to allow users to define technical attributes, thereby forming a standardized table name.

file

Based on the data standard, users only need to fill in the table field content when defining the table structure, and the platform will automatically map to the data standard with the same name after parsing and perform standard coverage detection when creating the table, simplifying the table building operation on the basis of standardization.

file

Logical Model and Indicator Design

The basic relationship between the most basic fact table and dimension table is displayed in the data model , which is convenient for subsequent development of indicators based on the solidified data relationship.

file

The data stack index management platform DataIndex can sort out the index system according to the business, and summarize it into an index catalog for each business field.

file

For each index, information such as its name, code, business caliber, processing logic, and scheduling attributes can be defined.

file

Unified management of data development, data quality verification and code

Data Stack supports two data development modes: offline development and real-time development. The following takes offline development as an example to introduce the data development process.

First, the administrator can configure the SQL development specification. The current platform has built-in some SQL specification inspection rules . In addition, some inspection rules can be developed and registered to the platform according to the development instructions. After these inspection rules take effect, the platform will scan the code before running and submitting the SQL. Among the abnormalities found in the scanning results, if the prompt rule is triggered, that is, a slight irregularity, a prompt will be given without affecting the operation and Submit; if a blocking rule is triggered, data development will not be able to run and submit. In this way, some high-risk SQL operations and unnecessary tasks that occupy a large amount of resources can be avoided in advance.

file file

Users can arrange a data development business process through workflow in offline development , write the code of each task and configure scheduling attributes and task dependencies.

file file

For the tasks created in the offline development platform, the code can be connected to the remote warehouse (Bitbucket, GitLab) for pull and push, so as to realize the unified management of the internal code of the enterprise. It is also often used to initialize the batch migration of tasks when the big data platform is replaced.

file

After the SQL code test is correct and submitted, the operation and maintenance personnel will usually package and release the task to another project. During the release process, it will be pre-checked whether the content of the release package is complete. The release approval process can be started in the data stack approval center. Control the standardization and impact of release.

In view of the network isolation between testing and production environments in financial scenarios, the release process can also be connected to the unified approval center within the enterprise. After approval, the cross-network release package transmission can be completed through tools such as jenkins to implement the task and put it into production.

file

At the same time, there are two very important questions: how to evaluate the quality of data produced? When a quality problem occurs, can the business process be interrupted in time and the developer be notified to deal with it in time?

The data asset platform DataAssets supports single-table and multi-table quality verification. Single-table verification has built-in verification rules for integrity, accuracy, standardization, and uniqueness. Users can also perform personalized data verification by customizing SQL; Multi-table verification can realize the data comparison of two tables. For example, in the data synchronization scenario, it can verify whether there are errors or omissions in the data reading and writing of the source end and the target end.

When the quality task is associated with the offline task, by configuring the strong and weak rules and alarms of the quality check, the important quality problem can be realized to stop the operation of the task flow in time and notify the relevant developers.

file

data service

The data produced by the data stack platform can provide external services through API, self-service query, and data synchronization to external libraries. It is often used in upper-level data applications such as reports, large screens, labels, and data portals.

file file

safety management

● User authentication

It supports single sign-on for docking enterprises, supports LDAP, Oauth2 and other authentication methods, and can configure multi-level Kerberos authentication .

● Data rights management

The data stack platform layer can realize data permission management under Hadoop, and can automatically identify and divide data into different levels. Under a specific Hadoop version, it also supports the permission policy of connecting to the Ranger engine, and can also connect to the existing data of the enterprise . Authority management system .

● Approval process docking

Permission applications for data resources such as tables and APIs, data standards, release of offline tasks, and other processes involving permission point changes or internal online launches can all be managed by the approval center in the access data stack.

file

● Operational Audit

All key operations such as task running, table DDL operations, adding and deleting users, and permission applications will record the audit list.

file

In the future, the data stack will continue to improve the whole link of data governance, improve the quality and efficiency of data production through product experience optimization and tool intelligent upgrade, and continuously provide power and guarantee for the value of enterprise data.

"Dutstack Product White Paper": https://www.dtstack.com/resources/1004?src=szsm

"Data Governance Industry Practice White Paper" download address: https://www.dtstack.com/resources/1001?src=szsm If you want to know or consult more about Kangaroo Cloud big data products, industry solutions, and customer cases, visit Kangaroo Cloud official website: https://www.dtstack.com/?src=szkyzg

At the same time, students who are interested in big data open source projects are welcome to join "Kangaroo Cloud Open Source Framework DingTalk Technology qun" to exchange the latest open source technology information, qun number: 30537511, project address: https://github.com/DTStack

Ministry of Industry and Information Technology: Do not provide network access services for unregistered apps Go 1.21 officially released Ruan Yifeng released " TypeScript Tutorial" Bram Moolenaar, the father of Vim, passed away due to illness The self-developed kernel Linus personally reviewed the code, hoping to calm down the "infighting" driven by the Bcachefs file system. ByteDance launched a public DNS service . Excellent, committed to the Linux kernel mainline this month
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3869098/blog/10094087