Activate the value of data, explore the data architecture and its practice under DataOps丨DTVision Development Governance

According to the China Academy of Information and Communications Technology, from 2012 to 2021, the scale of my country's digital economy will increase from 12 trillion yuan to 45.5 trillion yuan, and its proportion in the total GDP will increase from 21.6% to 39.8%. In line with the new trends of the times, it is an undoubted consensus that "data" has become a new factor of production.

If we say that the rise of the data center represents the digital transformation of enterprises from process-driven to data-driven, from digitization to intelligence. Then DataOps is an excellent idea or methodology for realizing the data center.

The concept of DataOps was proposed by Lenny Liebmann as early as 2014. In 2018, DataOps was officially included in Gartner's data management technology maturity curve, marking that DataOps was officially accepted and promoted by the industry. Although it is still in the early stage of development in China, the popularity of DataOps is increasing year by year, and it will be more widely used in the foreseeable future 2-5 years.

Kangaroo Cloud is one such explorer. As a full-link digital technology and service provider, Kangaroo Cloud has been deeply involved in the field of big data since its inception. With the acceleration of digital and intelligent transformation in all walks of life year by year, many problems in data governance and data management have gradually emerged.

To this end, driven by technological progress and the needs of customers’ digital transformation, Kangaroo Cloud has built a one-stop big data development and governance platform, Data Stack DTinsight, based on the concept of DataOps to carry out data value processes and realize the full life cycle of data. Quality supervision and data development process specifications to escort data governance.

respond to changes

One of the core concepts of DataOps is to respond to changes on the demand side in a timely manner. The following is a typical enterprise data architecture diagram:

Data flows in from the source system on the left, and the intermediate links are various data processing tools, such as data lakes, data warehouses or data marts, AI analysis, etc. The data undergoes processes such as cleaning, processing, summary statistics, and data governance. BI, customized reports, API and other tools serve various demanders.

file

When defining the platform architecture, data architects or managers in an enterprise generally mainly consider issues such as extreme performance, latency, and load management in the production environment. Many computing engines/databases are proficient in this. But this architecture does not reflect the ability to "respond to rapid changes":

It’s like designing a highway. In the design, only the traffic capacity under normal conditions is considered, and various temporary changes such as accidents, traffic jams, and heavy rains are not considered. Too much capacity (for example, a tunnel with only 2 lanes, a small scratch accident could block the entire tunnel). The data platform in the enterprise will also encounter a similar situation. The data workers in the enterprise must respond to such changes every day or even every hour. Sometimes it may even take several days for a simple SQL change to go live.

Considering various changes in the design stage, it can respond more flexibly and stably. The following is the data architecture from the perspective of DataOps.

Data Architecture from a DataOps Perspective

Data architects put forward some of the following agility standards at the beginning of construction, such as:

• The task needs to be completed within 1 hour from the completion of development to the release of production, and has no impact on the production environment

• Catch data errors in time before releasing to production

• Major changes, completed within 1 day

There are also some environmental issues to consider, including:

• It is necessary to maintain separate development, test and production environments, but to a certain extent to ensure their consistency, at least the consistency of metadata

• Manually orchestrate or automate data testing, quality monitoring, and deployment to production environments

When architects start thinking about data quality, rapid release, real-time monitoring of data, etc., enterprises take a step towards DataOps.

Decomposition and Practice of DataOps Architecture

After talking about the data architecture from the perspective of DataOps, let's talk about the decomposition and practice of the DataOps architecture. The specific practice of DataOps can be decomposed into the following key points:

file

Management of multiple environments

The first step of DataOps starts with "environment management", which generally refers to independent development, testing and production environments, each of which can support task orchestration, monitoring and automated testing.

At present, the data stack can support multiple environments at the same time. As long as the network is connected, the unified connection and unified management of multiple different environments can be realized by one set of data stacks. The data stack distinguishes different environments through the concept of "cluster" in Console. Different clusters can be flexibly connected to various computing engines, such as various open source or distribution versions of Hadoop, Star Ring Inceptor, Greenplum, OceanBase, and even MySQL , Oracle and other traditional relational databases, as shown in the following figure:

file

task release

Following the multi-environment management of the previous step, in the actual development process, tasks need to be released between multiple environments. Assuming that there are development, testing, and production environments, cascaded release needs to be performed between multiple environments. As shown below:

file

In this case of multi-environment release, DataStack can support release management in three ways:

● One-click release

When each environment network is connected, a set of platforms can be used to connect to each environment to realize "one-click publishing". During the one-click publishing process, only users with certain permissions can perform publishing actions, which improves the stability of the production environment. At the same time, some key environmental information can be automatically replaced, such as data source connection parameters in synchronization tasks, computing power configuration in different environments, etc. One-click publishing is more suitable for SaaS or internal cloud platform management.

● Import/Export publishing

In the vast majority of scenarios currently in domestic contact, in order to achieve a higher level of security, the production environment will adopt strict physical isolation. In this scenario, tasks can be published across environments by importing and exporting, and users can manually import new, changed, or deleted tasks to the downstream environment.

● Devops release

Some customers may have purchased or self-developed a company-level online publishing tool. In this case, the data stack needs to be customized to connect to its interface, and the relevant change information will be executed and published by a third-party CI tool (such as Jekins).file

Code version management

Every time a cross-environment release is performed, it is necessary to record the version of each release code to facilitate later troubleshooting. In actual scenarios, operations such as code comparison and version rollback between different versions are often required.

In addition to supporting the comparison of code content, the data stack also supports the comparison of more information related to tasks, including the configuration of task scheduling cycle, task execution parameters, environment parameters, etc., and can "one-click rollback" to the specified Version.

file

Access and Rights Management

Among multiple environments in an enterprise, the production environment is generally the most demanding, and the development and testing environments are relatively relaxed. In this case, it is necessary to manage the authentication or access information of users in different environments. In fact, in order to facilitate development and testing, and there is no sensitive data, in these two links, ordinary users generally have all data permissions and can access various tools, but in the production environment, users must only have their own permissions. data permissions.

Depending on the engine, the data stack can support a variety of data rights management methods, including:

● Hadoop engine

Kerberos-based authentication security + Ranger/LDAP-based data security. It can support data permission control at the library, table and field level. At the same time, it can support data desensitization.

file

● JDBC class engine

In some scenarios, customers may not use Hadoop to build the data platform, but use some JDBC-type databases (such as TiDB, Doris, Greenplum), and the data stack itself does not manage the permissions of the JDBC database, but uses account binding. In order to distinguish the permissions of different accounts, for example:

· Data stack A account, bind the database root account

· Data stack B account, bound to the database admin account

● Task scheduling, testing and monitoring

After the release is launched to production, the data stack can connect the above links. Users can publish to the test environment with one click from the development stage. After verification by the test environment, observe the operation of task instances and data output. After the operation is correct Can be released to production environment.

last words

DataOps is a best practice concept, but it is still in a relatively early stage in China. DataOps has some practical experience in this area, but there are still many things that can be optimized. For example, data quality rules also need to cross Environment publishing, task code, task template export need to support more task types, etc. It is expected that more DataOps best practices will be produced in the industry in the future.

Kangaroo Cloud Open Source Framework DingTalk Technology Exchange Group (30537511), welcome students who are interested in big data open source projects to join and exchange the latest technical information, open source project library address: https://github.com/DTStack/Taier

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3869098/blog/5581698