Big data development and management platform DataWorks

preamble

Learn the design concept of Ali DataWorks and what to do [email protected]

Reference documents:

  1. https://www.aliyun.com/product/bigdata/ide

  1. https://help.aliyun.com/document_detail/73015.html

  1. https://help.aliyun.com/document_detail/324149.html ----Data Governance

Launch

Based on big data engines such as Alibaba Cloud ODPS/EMR/CDP, DataWorks provides a unified full-link big data development and management platform for solutions such as data warehouse/data lake/hucang integration. As the builder of Alibaba's data center, DataWorks has continuously accumulated Alibaba's big data construction methodology since 2009, and at the same time, it has joined hands with tens of thousands of customers in government affairs/finance/retail/Internet/energy/manufacturing to help industry digital upgrade.

Smart Data Modeling

Data Warehouse Planning

When using DataWorks for data modeling, you can design data hierarchy, business classification, subject domain, and business process on the data warehouse planning page.

In layman's terms, the data warehouse specification here is to label the data, and distinguish different stages, scenarios, business and other types of data according to the data processing flow, business classification, business process, data domain and other classification methods, so as to facilitate future queries based on the type. data. [email protected]

data layering

You can design the data layering of the data warehouse in consideration of business scenarios and data scenarios. DataWorks will create a five-layer data warehouse layer commonly used in the industry by default: ----The following data classification or data processing flow should be our Usually you can think of [email protected]

  • Data introduction layer ODS (Operational Data Store)

  • Detailed data layer DWD (Data Warehouse Detail)

  • Summary data layer DWS (Data Warehouse Summary)

  • Application data layer ADS (Application Data Service)

  • Common dimension layer DIM (Dimension)

You can also create other layered data layers according to business needs. For the operation of creating data layers, see Creating Data Warehouse Layers .

Business Categories

When the business of an enterprise is complex and different types of businesses need to share data domains with each other, but you want to quickly locate the data of this business in the process of model design and application, you can plan different business classifications based on the real business situation. In the modeled dimension table and detail table, associate it with the corresponding business classification. For the operation of creating a business category, see Business Category .

data field

Data domain is a higher-level data classification standard. It is a collection of abstraction, refinement, and combination of enterprise business processes. It is the first group entry for enterprise business personnel when using data. Quickly delineate your own business data from the data.

The data domain is oriented to business analysis, and a data domain corresponds to a macro analysis domain, such as procurement domain, supply chain domain, HR domain, e-commerce business domain, etc. It is recommended that the setting of the data domain be managed and set by a unified organization or person (such as a data architect or model team member). The data domain designer needs to have a deep business understanding of the enterprise, and more expressive interpretation and interpretation of the business. abstract. For how to use DataWorks to plan and construct data domains, see Data Domains .

business process

A business process is a description of the process of business activities. For example, in the field of e-commerce, additional purchases, order placement, payment, etc. can all be a business process. When conducting business effect analysis, business processes have very typical applications, such as common funnel analysis, which is to decompose the business activities of purchasing goods into business processes such as browsing products, adding shopping carts, placing orders, paying, and confirming receipt, and counting each business The "number of orders" in the process will be able to conduct funnel analysis for the indicator of "number of orders". For how to create a business process using DataWorks, see Business Process .

data standard

DataWorks data modeling supports the planning and formulation of data standards before modeling, or precipitates the data standards of enterprise business according to business conditions during the modeling process. Through normative constraints on standard codes, units of measurement, data standards, and naming dictionaries, the consistency of data processing in subsequent modeling and application processes is guaranteed.

For example, there are two tables, the registration form and the login form. The registration form stores the member ID, and the field name is user_id . The login table also stores the member ID, and the field name is userid . At this time, a unified data is created for the member ID. Data standards, such as standard codes for specifying data processing, attribute requirements for specifying fields (such as field data types, lengths, default values, etc.), and specifying units of measurement for data. After the data standard is created, when the setting of the member ID field is involved in the subsequent modeling process, it can be directly associated with this standard, so as to ensure the uniform standard of all member ID fields.

This actually specifies the standard information of all fields in the business. Although the incoming data field information is different, they all correspond to the standard field content [email protected]

Dimensional modeling

The data modeling concept of DataWorks follows the dimensional modeling idea. When using the dimensional modeling function of DataWorks for data warehouse modeling design: ----------- Here is mainly for the classification of the model, different The model consists of different opening types

  • dimension table

Combined with the planning of the business data domain, the dimensions that may exist during data analysis in each business data domain are extracted, and the dimensions and their attributes are stored in the form of dimension tables. For example, when analyzing e-commerce business data, the available dimensions and their attributes are: order dimension (attributes include order ID, order creation time, buyer ID, seller ID, etc.), user dimension (gender, date of birth, etc.), Product dimensions (including product ID, product name, and product launch time), etc. At this point, you can create these dimensions and attributes as order dimension tables, user dimension tables, product dimension tables, etc., and use dimension attribute records as dimension table fields . Later, you can deploy these dimension tables to the data warehouse, and store the actual dimension data according to the definition of the dimension tables through ETL, so that business personnel can access them in subsequent data analysis.

  • list

Combined with the planning of the business process, sort out and analyze the actual data that may be generated in each business process, and store these actual data fields in the form of a detailed table. For example, in the business process of placing an order, you can create a detail table of placing an order to record the actual data fields that may be generated during the order placing process, such as order ID, order creation time, product ID, quantity, amount, etc. Later, you can deploy these detailed tables to the data warehouse, and use ETL to summarize and store the real data according to the definition of the detailed table, so that it can be easily accessed during business analysis.

  • Summary Table

You can combine business data analysis and data warehouse layering to summarize and analyze some detailed fact data and dimension data first, create a summary table, and then directly use the data in the summary table for subsequent data analysis, without the need to use the details Data in tables and dimension tables. ---- Commonly used similar to data layering

  • reverse modeling

Reverse modeling is mainly used to reverse model the models generated by other modeling tools into the dimensional modeling of DataWorks. For example, when you have generated models through other modeling tools and want to switch to DataWorks' intelligent modeling for subsequent modeling work, you can use the reverse modeling function. This function can help you quickly reverse-model existing models into DataWorks dimensional modeling without you having to perform modeling operations again, saving a lot of time and cost. ----The lofty theory of model reuse

For details on how to create dimension tables, detail tables, and summary tables, see Creating Dimension Tables , Creating Detail Tables , and Creating Summary Tables . For reverse modeling operations, see Reverse Modeling for details .

Data indicators

The data modeling of DataWorks provides the function of data indicators, providing you with the ability to establish a unified indicator system.

The indicator system consists of atomic indicators , modifiers , time periods and derived indicators . ----This should include the data standard, that is, it is used to specify which fields should meet the data standard under a business, and the related calculation fields [email protected]

  • Atomic indicator : It is a measurement based on a certain business process, such as the "payment amount" in the "payment order" business process.

  • Modifier : It is a restriction on the business scope of indicator statistics, such as limiting the statistical scope of "payment amount" to "maternal and infant products".

  • Time period : used to clarify the time range or time point for indicator statistics, for example, specify the time period for statistics of "payment amount" as "the last 7 days".

  • Derived indicators : defined by the combination of atomic indicators, modifiers, and time periods. For example, count the "payment amount" of "maternal and infant products" in the "last 7 days".

The need for data modeling

  • Standardized management of massive data

The larger the enterprise business, the more complex the data structure will be. The amount of enterprise data will grow rapidly with the rapid development of enterprise business. How to manage and store data in a structured and orderly manner is a challenge that every enterprise will face.

  • Business data interconnection, breaking down information barriers

The independent data among various businesses and departments within the company has formed a data island, which makes it impossible for decision-makers to understand the company's various data situations clearly and quickly. How to break the information islands between departments or business areas is a major problem in enterprise data management.

  • Data standard integration, unified and flexible docking

Different descriptions of the same data make enterprise data management difficult, duplicate content, and inaccurate results. How to formulate a unified data standard without breaking the original system architecture and realize flexible connection between upstream and downstream businesses is one of the core focuses of standardized management.

  • Data value maximization, enterprise profit maximization

Make good use of all kinds of enterprise data to the greatest extent, maximize the value of enterprise data, and provide enterprises with more efficient data services.

data integration

Data Integration is a stable, efficient , and elastically scalable data synchronization platform , dedicated to providing high-speed and stable data movement and synchronization capabilities between rich heterogeneous data sources under complex network environments .

What I said earlier is data modeling, here is how to get the original data [email protected]

usage restrictions

Data synchronization is definitely not compatible with all data sources, at least not all incompatible at one time

  • data synchronization:

It supports and only supports the synchronization of structured (such as RDS, DRDS, etc.), semi-structured, and unstructured (OSS, TXT, etc., requiring specific synchronization data to be abstracted into structured data) . That is, data integration only supports data synchronization that can be abstracted into a logical two-dimensional table, and does not support the synchronization of completely unstructured data (such as a piece of MP3) stored in OSS to MaxCompute.

  • Network connectivity:

Support data synchronization requirements for mutual synchronization and exchange of data storage in a single region and some cross-regions. Some regions can be transmitted through the classic network, but the connectivity cannot be guaranteed. If the test classic network fails, it is recommended that you use the public network to connect.

  • data transmission:

Data integration only completes data synchronization (transmission), and does not provide a way to consume data streams.

  • Data Consistency:

Data integration synchronization only supports at least once, not exact once, that is, data duplication cannot be guaranteed, and can only be guaranteed by primary key + destination capability. --------------------This is the shortcoming [email protected]

Introduction to Offline (Batch) Sync

Data integration is mainly used for offline (batch) data synchronization. The offline (batch) data channel provides a set of abstract data extraction plug-ins (Reader) and data writing plug-ins (Writer) by defining data sources and data sets of data sources and destinations , and designs a simplified version based on this framework The intermediate data transmission format, so as to realize data transmission between arbitrary structured and semi-structured data sources.

The key point is to read plug-ins and write plug-ins for batch information reading.

Introduction to real-time synchronization

The real-time synchronization of data integration includes three basic plug-ins of real-time reading , conversion and writing , and each plug-in interacts through the internally defined intermediate data format.

A real-time synchronization task supports multiple conversion plug-ins for data cleaning , and supports multiple write plug-ins for multiple output functions . At the same time, for some scenarios, it supports real-time synchronization solutions for the entire database, and you can synchronize multiple tables in real time at one time. For details, see Real-time Data Synchronization .

Duolu outputs the content that coincides with dimensional modeling.

Synchronization Solution Advantages

  • Full data initialization.

  • Incremental data is written in real time.

  • Incremental data and full data are automatically merged and written to a new full table partition at regular intervals.

synchronization parameters

  • concurrent number

The number of concurrency is the maximum number of threads that can read from the source or write to the data storage side in parallel in the data synchronization task. ----The concurrency of reading is more effective, how to avoid repeated reading? ? ? ?

  • speed limit

The rate limit is the transmission speed limit that the data integration synchronization task can achieve.

  • dirty data

脏数据是对于业务没有意义,格式非法或者同步过程中出现问题的数据。单条数据写入目标数据源过程中发生了异常,则此条数据为脏数据。 因此只要是写入失败的数据均被归类于脏数据。例如,源端是VARCHAR类型的数据写到INT类型的目标列中,导致因为转换不合理而无法写入的数据。您可以在同步任务配置时,控制同步过程中是否允许脏数据产生,并且支持控制脏数据条数,即当脏数据超过指定条数时,任务失败退出。

  • 数据源

DataWorks所处理的数据的来源,可能是一个数据库或数据仓库。DataWorks支持各种类型的数据源,并且支持数据源之间的转换。

在数据集成同步任务配置前,您可以在DataWorks数据源管理页面,配置好您需要同步的源端和目标端数据库或数据仓库的相关信息,并在同步过程中,通过选择数据源名称来控制同步读取和写入的数据库或数据仓库。

数据治理

DataWorks数据治理中心可自动发现平台使用过程中数据存储任务计算代码开发数据质量数据安全等维度存在的待治理问题,并通过健康分模型进行量化评估,从全局、工作空间、个人等多个视角,以治理报告及治理排行榜的形式呈现治理成果,帮助您有效推动解决治理问题达成治理目标。在成本治理方面,数据治理中心提供任务资源消耗明细、资源消耗整体趋势、单任务费用预估等丰富功能,可帮助您对各类资源使用费用进行有效的优化控制。

治理内容

如下的主要分为 开发前检查与开发后检查,检查的对象根据所在的环节有所不同[email protected]

  • 检查项:用于数据开发等环节的事前检查,在开发流程中检测不符合数据规范的内容,生成影响开发流程正常执行的问题事件,约束、管理开发流程。

例如,检查项可以配置为禁止使用select*语句,不允许通过create table语句创建表等。

  • 检查项事件:检查项检测出的影响开发流程正常执行的问题事件。

  • 治理项:用于数据开发后的分析环节,检测系统存在的待治理优化问题。治理项包括必选治理项和可选治理项。

例如,治理项可以配置为任务运行时间超长、连续出错节点、无人访问叶子节点、空跑节点等。

  • 治理项问题:治理项检测出的待治理优化问题。

  • 治理方案模板:数据治理中心提供的统一化治理模板,配置了常见的检查项及治理项,您可以直接使用该模板进行数据问题检测。

  • 健康分:基于治理项,按照系统预先定义的模型计算得出,用于体现治理成效。

  • 治理单元:由一个或多个工作空间组成,用于集中统计指定工作空间的整体健康分、治理项问题和检查项事件。

  • 知识库:数据治理中心提供的,针对常见检查项事件及治理项问题给出的解决方案。

数据治理逻辑

如上开发前的检查叫做检查项,开发后的治理叫治理项

数据治理问题检测包括数据开发前的检查项检测数据开发后的治理项检测帮助您全方位管控当前数据存在的待治理问题。同时,数据治理中心为您提供了统一的治理方案模板,汇总了常见检查项及治理项,您可以直接使用该模板,检测平台使用过程中的待治理问题。数据治理逻辑如下图所示。

  • 检查项检测(开发前,即数据处理前)

用于数据开发前的管控治理,主要在执行提交、发布等操作时校验规范性等问题。在进行数据开发前,您可以通过检查项对数据开发功能相关的约束进行检查,当检查出存在不符合约束规范的内容时,系统会生成影响开发流程正常执行的问题事件。您可以基于该事件处理暴露的问题,以便数据开发流程可以正常执行。

  • 治理项检测(开发后,即数据处理后)

用于数据开发后的管控治理。数据开发完成后,您可以使用数据治理中心的治理功能,通过全局视角个人视角工作空间视角,查看对应的待治理项。数据治理人员可以基于暴露的待治理项,快速发现并解决存在的问题,推进团队内的数据治理目标。

Guess you like

Origin blog.csdn.net/cuiyaonan2000/article/details/129362305