Abstract: The data live broadcast series courses of SDA mainly share how Alibaba's big data evolves and how to use big data technology to build an enterprise-level big data platform with the theme of the construction ideas of enterprise big data warehouse based on Alibaba Cloud data and MaxCompute. This sharing guest is Yixiu, a technical expert from Alibaba Cloud big data. Background and general idea A data warehouse is a subject-oriented, integrated, non-volatile data collection that reflects historical changes to support management decisions.
Original link: http://click.aliyun.com/m/43803/
The data live broadcast series of SDA mainly focuses on the construction ideas of enterprise big data warehouse based on Alibaba Cloud SHU and MaxCompute, sharing how Alibaba's big data has evolved and how to use big data technology to build an enterprise-level big data platform.
The guest of this sharing is Yixiu, a technical expert from Alibaba Cloud Big Data!
Background and general idea
A data warehouse is a subject-oriented, integrated, non-volatile collection of data that reflects historical changes to support management decisions. Its structure diagram is as follows:
With the application and popularization of technologies such as big data and cloud computing, data processing in the Internet environment presents new characteristics: rapid business changes; many data sources; many system couplings; deep application depth. The acceleration of business changes has led to an increase in data sources. Most of the previous data came from the application system database, which was basically structured data, such as Oracle, MySQL and other data. In the current Internet environment, there are more data, such as website click logs, video data, and voice data. These data all need to be calculated in a unified way to reflect the business status of enterprises. In the Internet environment, there are relatively many system couplings. The most important thing is to pay attention to how to deepen data integration and improve application depth in such an environment. From the perspective of application depth, it was more focused on report analysis before, and in the big data environment, more algorithm analysis was performed, and future trends were predicted and judged by establishing data models. So in this case, the demand for the system is also higher:
Require the result data to be obtained as quickly as possible;
Increased real-time demand;
Diverse and convenient ways to access and obtain;
Safety requirements are high.
Under the high demand, traditional warehouses are bound to face challenges: the rapid growth of data volume leads to a decrease in operating efficiency; the cost of data integration is high; it cannot handle diverse data; data mining and other in-depth analysis capabilities are lacking. Based on these characteristics, how should users build a big data warehouse? During the construction of Alibaba Cloud's data warehouse, the following four metrics were summarized:
Stability - data output is stable and guaranteed, and the stability of the system is maintained;
Credible - clean data and high enough data quality to bring more efficient application services;
Abundant - the business scope covered by the data is broad enough;
Transparency - The composition system of data should be transparent enough to make users feel at ease.
A complete big data warehouse should have the characteristics of massive data storage and processing capabilities, diverse programming interfaces and computing frameworks, rich data collection channels, various security protection measures and monitoring, etc. Therefore, certain design needs to be followed when constructing the architecture. Guidelines:
Top-down + bottom-up design, data-driven and application-driven integration;
Pay attention to high fault tolerance in technology selection to ensure system stability;
Data quality monitoring runs through the entire data processing process;
Not afraid of data redundancy, make full use of storage and exchange for ease of use, and reduce complexity and calculation.
Architecture and Model Design
Generally speaking, the construction of data warehouse needs to go through the above processes. A good architecture design can well meet the requirements in terms of functional architecture, data architecture, and technical architecture:
Functional Architecture Example: Clearly Hierarchical Structure
Data architecture example: focus on data flow and ensure data quality
Technical architecture example: easy to expand, easy to use
The primary task of building a data warehouse is model design. There are two modeling methods generally used in the industry:
Dimensional modeling: simple structure; easy to analyze factual data; suitable for business analysis reports and BI.
Entity modeling: complex structure; easy to open up subject data; suitable for deep mining of complex data content.
Users can distinguish according to the actual situation, and in the actual data warehouse, the star model and the snowflake model coexist, which is conducive to data application and reduces computing resource consumption.
In the layering of data processing, the upper and lower three-layer structures are generally used:
This design is to compress the length of the overall data processing process. The flattened data processing process is helpful for data quality control and data operation and maintenance; using stream processing as part of the data system can pay more attention to the timeliness of data and make data valuable. higher.
basic data layer
data middle tier
Connecting behaviors around entities can integrate data sources; abstracting relationships from behaviors is a very important data dependency for upper-layer applications in the future. In addition, redundancy is a good way to ensure the integrity of the subject and improve the ease of use of the data.
Data mart layer
The market layer construction driven by demand scenarios is built vertically between each market. It needs to be able to quickly trial and error and deeply mine the value of data.
Building a Big Data Warehouse Based on Alibaba Cloud Data Plus
The entire business process of building a big data warehouse based on Alibaba Cloud Data Plus is as follows:
Alibaba Cloud's data plus architecture is mainly divided into three levels: data integration, data system, and data application, as shown in the following figure:
Structured data collection usually involves full collection and incremental collection. Full acquisition is the data initialization of the entire data warehouse, and the historical data is quickly synchronized to the computing platform; incremental acquisition is the data synchronization after initialization. However, when the amount of data is huge, the incremental data synchronization resource consumption is serious, or the subsequent data application needs to use quasi-real-time data, the real-time acquisition method will also be used. This method has certain requirements on the acquisition end system, and Acquisition quality is the most difficult to control.
In fact, the more standardized the original structure of the log, the lower the cost of parsing. Before the logs are collected on the platform, it is recommended not to structure the logs as much as possible, and then implement the log structure through the UDF or MR computing framework.
Correspondence between data warehouse and Alibaba Cloud Data Plus products
Offline warehouse: the security of MaxCompute data sharing
The security of the warehouse is the most important topic. The multi-tenant data authorization model based on MaxCompute is a very secure data sharing mechanism, and can effectively prevent and control data flow and access restrictions.
Some Best Practices in Architecture Design
Data table naming convention
Partition table, workflow design
Computing framework applications, optimizing critical paths
Some friendly cases in actual development
Governing Big Data with Big Data
Data governance is divided into security mechanisms, management, and content construction, and runs through the entire process of data development:
In order to effectively measure the effect of data governance, the data management health assessment system used by Alibaba Cloud can correctly understand the health of data management and give data management health scores.
In the process of data governance, the more important point is duplicate data governance. Duplicate data governance has many manifestations:
Same source: drag and drop the same table repeatedly;
Computationally similar: read tables are the same and processing features are similar;
Simple processing: simple conversion, cutting and saving to a new table;
Same table and same partition: data remains unupdated or business has stopped;
Empty stopwatch: The operation result data remains empty;
Similar naming: table names or field names are highly similar;
Special Rules: Identified by known business rules.
Data Quality Management System
Data Lifecycle Management
Summary: Alibaba's Big Data Practice Road