Construction ideas of enterprise big data warehouse architecture based on Alibaba Cloud data and MaxCompute

Abstract: The    data live broadcast series courses of SDA mainly share how Alibaba's big data evolves and how to use big data technology to build an enterprise-level big data platform with the theme of the construction ideas of enterprise big data warehouse based on Alibaba Cloud data and MaxCompute. This sharing guest is Yixiu, a technical expert from Alibaba Cloud big data. Background and general idea A data warehouse is a subject-oriented, integrated, non-volatile data collection that reflects historical changes to support management decisions.

Original link: http://click.aliyun.com/m/43803/

The data live broadcast series of SDA mainly focuses on the construction ideas of enterprise big data warehouse based on Alibaba Cloud SHU and MaxCompute, sharing how Alibaba's big data has evolved and how to use big data technology to build an enterprise-level big data platform.

 

The guest of this sharing is Yixiu, a technical expert from Alibaba Cloud Big Data!

 

Background and general idea

 

A data warehouse is a subject-oriented, integrated, non-volatile collection of data that reflects historical changes to support management decisions. Its structure diagram is as follows:

 

060aed260e37b9ea7e1232e1366019f4989af166

 

With the application and popularization of technologies such as big data and cloud computing, data processing in the Internet environment presents new characteristics: rapid business changes; many data sources; many system couplings; deep application depth. The acceleration of business changes has led to an increase in data sources. Most of the previous data came from the application system database, which was basically structured data, such as Oracle, MySQL and other data. In the current Internet environment, there are more data, such as website click logs, video data, and voice data. These data all need to be calculated in a unified way to reflect the business status of enterprises. In the Internet environment, there are relatively many system couplings. The most important thing is to pay attention to how to deepen data integration and improve application depth in such an environment. From the perspective of application depth, it was more focused on report analysis before, and in the big data environment, more algorithm analysis was performed, and future trends were predicted and judged by establishing data models. So in this case, the demand for the system is also higher:

 

Require the result data to be obtained as quickly as possible;

Increased real-time demand;

Diverse and convenient ways to access and obtain;

Safety requirements are high.

 

Under the high demand, traditional warehouses are bound to face challenges: the rapid growth of data volume leads to a decrease in operating efficiency; the cost of data integration is high; it cannot handle diverse data; data mining and other in-depth analysis capabilities are lacking. Based on these characteristics, how should users build a big data warehouse? During the construction of Alibaba Cloud's data warehouse, the following four metrics were summarized:

 

Stability - data output is stable and guaranteed, and the stability of the system is maintained;

Credible - clean data and high enough data quality to bring more efficient application services;

Abundant - the business scope covered by the data is broad enough;

Transparency - The composition system of data should be transparent enough to make users feel at ease.

 

A complete big data warehouse should have the characteristics of massive data storage and processing capabilities, diverse programming interfaces and computing frameworks, rich data collection channels, various security protection measures and monitoring, etc. Therefore, certain design needs to be followed when constructing the architecture. Guidelines:

 

Top-down + bottom-up design, data-driven and application-driven integration;

Pay attention to high fault tolerance in technology selection to ensure system stability;

Data quality monitoring runs through the entire data processing process;

Not afraid of data redundancy, make full use of storage and exchange for ease of use, and reduce complexity and calculation.

 

Architecture and Model Design

 

93c2fa36ea2daa9515cbafdd431ebe7613e5953c

 

Generally speaking, the construction of data warehouse needs to go through the above processes. A good architecture design can well meet the requirements in terms of functional architecture, data architecture, and technical architecture:

 

a1b6b4cd2f78b3e4e78acd594969ee716346dcab

Functional Architecture Example: Clearly Hierarchical Structure

b1c0a4e7f2dd508744de44babbc64a2e4ec7258d

Data architecture example: focus on data flow and ensure data quality

7565c3bb25c68849cbab2e996bd53497c3769cf8

Technical architecture example: easy to expand, easy to use

 

The primary task of building a data warehouse is model design. There are two modeling methods generally used in the industry:

 

Dimensional modeling: simple structure; easy to analyze factual data; suitable for business analysis reports and BI.

Entity modeling: complex structure; easy to open up subject data; suitable for deep mining of complex data content.

 

Users can distinguish according to the actual situation, and in the actual data warehouse, the star model and the snowflake model coexist, which is conducive to data application and reduces computing resource consumption.

 

In the layering of data processing, the upper and lower three-layer structures are generally used:

 

1a2938a69683e68c7583ed24485cbcff635f8a02

 

This design is to compress the length of the overall data processing process. The flattened data processing process is helpful for data quality control and data operation and maintenance; using stream processing as part of the data system can pay more attention to the timeliness of data and make data valuable. higher.

 

basic data layer

 

9c5ad5aaaceb19b3a4050c1e09003f52cdc9e5c5

 

data middle tier

 

cabbece6afd4059d5f90f6d1679429e43cda2d6e

 

Connecting behaviors around entities can integrate data sources; abstracting relationships from behaviors is a very important data dependency for upper-layer applications in the future. In addition, redundancy is a good way to ensure the integrity of the subject and improve the ease of use of the data.

 

Data mart layer

 

0385e4d6467123ab3ee80cdd8b3f6f80da5148d9

 

The market layer construction driven by demand scenarios is built vertically between each market. It needs to be able to quickly trial and error and deeply mine the value of data.

 

Building a Big Data Warehouse Based on Alibaba Cloud Data Plus

 

The entire business process of building a big data warehouse based on Alibaba Cloud Data Plus is as follows:

 

3f0cda26ec0bc641ce982462580e100d24f22227

 

Alibaba Cloud's data plus architecture is mainly divided into three levels: data integration, data system, and data application, as shown in the following figure:

 

0bc76d18e4ef5fce3eb6184cff3a9193b2e03f1b

 

Structured data collection usually involves full collection and incremental collection. Full acquisition is the data initialization of the entire data warehouse, and the historical data is quickly synchronized to the computing platform; incremental acquisition is the data synchronization after initialization. However, when the amount of data is huge, the incremental data synchronization resource consumption is serious, or the subsequent data application needs to use quasi-real-time data, the real-time acquisition method will also be used. This method has certain requirements on the acquisition end system, and Acquisition quality is the most difficult to control.

 

In fact, the more standardized the original structure of the log, the lower the cost of parsing. Before the logs are collected on the platform, it is recommended not to structure the logs as much as possible, and then implement the log structure through the UDF or MR computing framework.

 

Correspondence between data warehouse and Alibaba Cloud Data Plus products

 

3d9e86c2112d3440caa1a585a1f9c2cfdf7cc0bd

 

Offline warehouse: the security of MaxCompute data sharing

 

The security of the warehouse is the most important topic. The multi-tenant data authorization model based on MaxCompute is a very secure data sharing mechanism, and can effectively prevent and control data flow and access restrictions.

 

9927f73e448e88dd9259e52f78e8e27353f891ea

 

Some Best Practices in Architecture Design

 

d19bc860df9870485f1ce33a21c271a3e3baaad1

Data table naming convention

a5822297f3dd72097075735cba1845dc9f3d47a7

Partition table, workflow design

fd91900ac97a83e97cc9e14e1c96f82c56e4cd37

Computing framework applications, optimizing critical paths

 

Some friendly cases in actual development

 

e364f772e50d5cd7dd039910ef0781c531f7040a

 

7be5ffb8987e8a24921ce91401ad0cea099e5642

 

ffbcccea89e52d1d52eb6d38769fda1c7735c920

 

77740b641a593abcaff3787d93b578085438f1e7

 

Governing Big Data with Big Data

 

Data governance is divided into security mechanisms, management, and content construction, and runs through the entire process of data development:

 

fc216f6d98bb351daa2fe045c3171c2d9439d3e0

 

In order to effectively measure the effect of data governance, the data management health assessment system used by Alibaba Cloud can correctly understand the health of data management and give data management health scores.

 

291a7f9565b5d25eb69afe9a74fc572a1f3cc7e5

 

In the process of data governance, the more important point is duplicate data governance. Duplicate data governance has many manifestations:

 

Same source: drag and drop the same table repeatedly;

Computationally similar: read tables are the same and processing features are similar;

Simple processing: simple conversion, cutting and saving to a new table;

Same table and same partition: data remains unupdated or business has stopped;

Empty stopwatch: The operation result data remains empty;

Similar naming: table names or field names are highly similar;

Special Rules: Identified by known business rules.

 

Data Quality Management System

 

333ceb6e7b728403874dd5be0e12e1944fdca3fd

 

Data Lifecycle Management

 

a625f52495e99f59dffb3cea8eb8496c92b0bad1

 

Summary: Alibaba's Big Data Practice Road

 

b0a894c4c33738d88ccf3c2611ac2ed37e73003a

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324483895&siteId=291194637