Introduction to Big Data Hive Data Warehouse

Table of contents

1. Data warehouse concept

2. Scenario case: Why does the data warehouse come about? 

2.1 Preservation of operational records 

2.2 Analytical decision making 

2.3 Is analysis feasible in an OLTP environment? 

2.4 Construction of data warehouse  

3. Main features of data warehouse 

3.1 Subject-Oriented  

3.2 Integrated 

3.3 Non-Volatile and Non-Volatile 

3.4 Time-Variant 

4. Data warehouse, database, data mart 

4.1 OLTP、OLAP 

4.2 Data warehouse and database  

4.3 Data warehouse and data mart 

5. Data warehouse hierarchical architecture 

5.1 Data warehouse layering ideas and standards 

5.2 Alibaba Data Warehouse 3-tier architecture 

5.2.1 ODS layer (Operation Data Store) 

5.2.2 DW layer (Data Warehouse) 

5.2.3 DA layer (or ADS layer) 

5.3 Benefits of layering 

5.4 ETL、ELT  

5.4.1 Background 

5.4.2 Concept 

6. Scenario analysis: Meituan-Dianping’s wine and hotel digital warehouse construction practice 

6.1 Architecture changes 

6.2 Theme construction 

6.3 Overall architecture 


 

1. Data warehouse concept

        Data Warehouse (English: Data Warehouse, referred to as Data Warehouse, DW) is a data system used for storage, analysis, and reporting. The purpose of the data warehouse is to build an integrated data environment for analysis and provide decision support (Decision Support) for enterprises .

        The data warehouse itself does not "produce" any data , and its data comes from different external systems ; at the same time, the data warehouse itself does not need to "consume" any data , and the results are open to various external applications; this is why it is called a "warehouse". The reason why it is not called "factory".

2. Scenario case: Why does the data warehouse come about? 

        Let’s draw a conclusion first: we come here to analyze data , and the analysis results provide support for corporate decision-making. In business, information is always used for two purposes :

( 1 ) Operational record keeping , ( 2 ) Analytical decision-making .

Let's take the development of China Life Insurance Company ( chinalife ) as an example to explain why the data warehouse came about?

2.1  Preservation of operational records 

        China Life Insurance (Group) Company has multiple business lines under its jurisdiction, including: life insurance, property insurance, auto insurance, pension insurance, etc. The normal operation of each business line requires the record maintenance of information including customers, policies, collection and payment, underwriting, claims and other information.

        Online transaction processing system (OLTP ) can exactly meet the above business needs , and its main task is to perform online transaction processing . Its basic feature is that the user data received by the front desk can be immediately transferred to the background for processing, and the processing results can be given in a very short time . Relational database (RDBMS) is  a typical application of OLTP  , such as Oracle , MySQL , SQL Server  , etc. 

2.2  Analytical decision making 

        As the group's business continues to operate, there will be more and more business data . This also creates many operational-related confusions: Can we determine which insurance lines are deteriorating or have become non-performing insurance lines? Can new and renewal policies be formulated in an effective manner? Is there a possibility of fraud in the claims process? Are the reports you are getting now only for a certain business line? What is the data at the overall group level?

        In order to correctly understand these problems and formulate relevant solutions, it is definitely not possible to slap the table blindly. The safest way is to conduct data analysis based on business data and provide support for decision-making based on the analysis results . This is called data-driven decision making.

 

Then, face the next question: where to perform data analysis? Is the database OK? 

2.3  Is it feasible to conduct analysis in an OLTP  environment? 

Yes , but not necessary.

        The core of the OLTP system is business-oriented, business-supporting, and transaction-supporting . All business operations can be divided into two operations: reading and writing. Generally speaking, the pressure of reading is significantly greater than the pressure of writing . If you perform various analyzes directly in  the OLTP  environment, there are the following issues to consider:

  • Data analysis also performs reading operations on data, which will double the reading pressure;
  • OLTP  stores only weeks or months of data;
  • The data is scattered in different tables in different systems, and the field type attributes are not uniform;

When the size of the data involved in the analysis is small, direct analysis         can be carried out on the OLTP system during low business peak periods  . However, in order to better conduct data analysis of various scales without affecting  the operation of the OLTP  system, it is necessary to build an integrated and unified data analysis platform .

        The purpose of this platform is simple: it is analytics-oriented, supports analytics , and is  decoupled from OLTP  systems. Based on this demand, the prototype of data warehouse began to appear in enterprises.

2.4  Construction of data warehouse  

        As the definition of data warehouse says, a data warehouse is a data system used for storage, analysis, and reporting, with the purpose of building an integrated data environment for analysis. We call this analysis-oriented and analysis-supporting system  an OLAP (Online Analytical Processing) system . Data warehouse is  a type of OLAP  .

China Life Insurance Company can build a data warehouse platform based on analysis and decision-making needs.

3. Main features of data warehouse 

The purpose of         the data warehouse is to build an integrated data environment for analysis , and the analysis results provide decision support ( Decision Support ) for the enterprise. The data warehouse itself does not "produce" any data , and its data comes from different external systems; at the same time, the data warehouse itself does not need to "consume" any data , and its results are open to various external applications.

3.1  Subject - Oriented  

        The biggest feature of a database is the organization of data for applications, and each business system may be separated from each other. Data warehouses, on the other hand, are subject-oriented . Theme is an abstract concept, which is an abstraction for synthesizing, classifying , analyzing and utilizing data in enterprise information systems at a higher level . In a logical sense, it corresponds to the analysis object involved in a certain macro analysis field in the enterprise .

        Operational processing (traditional data) segmentation of data is not suitable for decision analysis. The data organized based on topics are different. They are divided into independent fields. Each field has its own logical connotation but does not overlap with each other. It provides a complete, consistent and accurate description of the data at an abstract level .

3.2  Integrated _ _ 

        After determining the topic, you need to obtain data related to the topic. Today's subject-related data in enterprises are usually distributed in multiple operating systems, which are dispersed, independent, and heterogeneous .

        Therefore, before the data enters the data warehouse, it must be unified and integrated to extract, clean, convert and summarize the data . This step is the most critical and complex step in the construction of the data warehouse. The tasks to be completed are:

  1. It is necessary to unify all contradictions in the source data, such as fields with different names and synonyms, inconsistent units, inconsistent word lengths, etc.
  2. Perform data synthesis and calculations. The data synthesis work in the data warehouse can be generated when data is extracted from the original database, but many of them are generated within the data warehouse, that is, they are synthesized after entering the data warehouse.

        The diagram below illustrates a simple process for an insurance company's consolidated data, where the data related to the topic "underwriting" in the data warehouse comes from several different operational systems. The naming of data within these systems may be different, and the data format may also be different. These inconsistencies need to be removed before storing data from different sources into the data warehouse. 

3.3  Non-volatile and non-variable ( Non-Volatile ) 

        A data warehouse is a platform for analyzing data, not a platform for creating data . We use the data warehouse to analyze the patterns in the data, rather than create and modify the patterns. So once the data enters the data warehouse, it is stable and does not change.

        Operational databases mainly serve daily business operations, which requires the database to continuously update data in real time in order to quickly obtain the latest data without affecting normal business operations. In the data warehouse, as long as past business data is saved, there is no need to update the data warehouse in real time for every business. Instead, a batch of newer data is imported into the data warehouse at regular intervals according to business needs.

        The data in the data warehouse reflects the content of historical data over a long period of time . It is a collection of database snapshots at different points in time, as well as exported data based on statistics, synthesis and reorganization of these snapshots.

        Most of the data operations performed by data warehouse users are data queries or relatively complex mining. Once the data enters the data warehouse, it is generally retained for a long time. There are generally a large number of query operations in data warehouses, but very few modification and deletion operations .

3.4  Time - Variant 

        Data warehouses contain historical data at various granularities, which may be related to a specific date, week, month, quarter, or year. Although users of the data warehouse cannot modify the data, it does not mean that the data in the data warehouse will never change.

        The results of the analysis can only reflect the past situation. When the business changes, the patterns discovered will lose their timeliness. Therefore, the data in the data warehouse needs to be updated over time to meet the needs of decision-making . From this perspective, data warehouse construction is not only a project, but also a process.

The changes in data warehouse data over time are reflected in the following aspects:

  • The data age limit of data warehouse is generally much longer than the data age limit of operational data.
  • Operational systems store current data, while data in data warehouses are historical data.
  • The data in the data warehouse are appended in chronological order, and they all have time attributes.

4. Data warehouse, database, data mart 

4.1 OLTPOLAP 

        Operational processing, called Online Transaction Processing (OLTP ) , has the main goal of data processing. It is a daily operation in the database for specific businesses, usually querying and modifying a small number of records . Users are more concerned about issues such as operation response time, data security, integrity, and the number of concurrently supported users. Traditional relational database system ( RDBMS ), as the main means of data management, is mainly used for operational processing .

        Analytical processing is called online analytical processing OLAP ( On-Line Analytical Processing ), and its main goal is data analysis . Generally, complex multi-dimensional analysis is performed on historical data on certain topics to support management decisions . Data warehouse is  a typical example of OLAP  system , mainly used for data analysis.

Let’s compare OLTP and OLAP from many different angles:

OLTP

OLAP

data source

Contains only current daily business data

Integrate data from multiple sources, including  OLTP  and external sources

Purpose

Application-oriented, business-oriented, supporting affairs

Subject-oriented, analysis-oriented, supporting analysis and decision-making

focus

present

Mainly facing the past and history, real-time data warehouse

Task

Read and write operations

Lots of reads and few writes

Response time

millisecond

seconds, minutes, hours or days

Depends on data volume and query complexity

The amount of data

Small data, MB, GB

Big data, TP, PB

4.2  Data warehouse and database  

        The difference between database and data warehouse is actually the difference between OLTP and  OLAP  . The typical application of OLTP system is  RDBMS, which is what we commonly call database. Of course, it should be emphasized here that this database represents a relational database, and  Nosql  database is not within the scope of discussion. A typical application of OLAP system is  DW, which is commonly known as data warehouse.

  • The data warehouse is not a large database, although the data warehouse stores data on a large scale.
  • The emergence of data warehouse is not to replace the database.
  • The database is transaction-oriented and the data warehouse is subject-oriented.
  • Databases generally store business data, while data warehouses generally store historical data.
  • Databases are designed to capture data, and data warehouses are designed to analyze data.

4.3  Data warehouse and data mart 

        The data warehouse ( Data Warehouse ) is for the data of the entire group organization , and the data mart (Data Mart)  is for the use of a single department . A data mart can be considered a subset of a data warehouse, and some people call a data mart a small data warehouse . Data marts typically address only one subject area, such as marketing or sales. Because they are smaller and more specific, they are generally easier to manage and maintain and have a more flexible structure.

        In the figure below , various operational system data and other data including files are used as data sources and are  filled into the data warehouse through ETL ( extraction, transformation and loading ) ; there are different subject data in the data warehouse, and the data mart is based on the department. Features are oriented to specified topics, such as  Purchasing , Sales , and Inventory ; users can carry out various applications based on topic data: data analysis, data reporting, and data mining.

5. Data warehouse hierarchical architecture 

5.1 Data warehouse layering ideas and standards 

        The characteristic of the data warehouse is that it does not produce data itself, nor does it ultimately consume data. Stratification according to the process of data flowing in and out of the data warehouse seems to be a matter of course. Each enterprise can be divided into different levels according to its own business needs. But the most basic layering idea is theoretically divided into three layers: operational data layer ( ODS ), data warehouse layer ( DW ) and data application layer ( DA ) . In practical applications, enterprises can add new layers based on this basic layer to meet different business needs.

5.2  Alibaba Data Warehouse  -layer Architecture 

        In order to better understand the idea of ​​data warehouse layering and the functional significance of each layer, the following is an analysis based on the data warehouse layered architecture diagram provided by Alibaba. Alibaba Data Warehouse has a very classic 3-  layer architecture, from bottom to top: ODS , DW , and DA . Through metadata management and data quality monitoring, we can control the flow process, dependencies and life cycle of data in the entire data warehouse. We are not going to do an in-depth discussion at the moment, but only have a macro understanding.

5.2.1  ODS  layer ( Operation Data Store ) 

        The operational data layer is also called the source data layer, data introduction layer, data temporary storage layer, and temporary cache layer . This layer stores unprocessed raw data to the data warehouse system. It is structurally consistent with the source system and is the data preparation area of ​​the data warehouse . Mainly responsible for introducing basic data into the data warehouse, decoupling it from the data source system, and recording historical changes in basic data.

5.2.2  DW  layer ( Data Warehouse ) 

        The data warehouse layer is processed from  the ODS  layer data. It mainly completes data processing and integration , establishes consistent dimensions, builds reusable detailed fact tables for analysis and statistics, and summarizes public-granularity indicators . The specific internal divisions are as follows:

  • Common Dimension Layer ( DIM ): Based on the dimensional modeling concept, the consistent dimension of the entire enterprise is established.
  • Public summary granularity fact layer ( DWS , DWB ): Using the subject object of analysis as the modeling driver, based on the indicator requirements of upper-layer applications and products, a public-granularity summary indicator fact table is constructed, and the model is physicalized by means of wide tables
  • Detailed-grained fact layer ( DWD ) : Make certain important dimension attribute fields of the detailed fact table appropriately redundant, that is, wide table processing.

5.2.3  DA  layer (or ADS layer) 

        The data application layer is oriented to end users and business-oriented to customize the data provided for products and data analysis . Including front-end reports, analysis charts, KPIs , dashboards, OLAP  topics, data mining and other analyses.

5.3  Benefits of layering 

        The main reason for layering is to have a clearer control over the data when managing data. In detail, there are mainly the following reasons:

  • clear data structure

Each data layer has its scope, which makes it easier to locate and understand when using tables.

  • Data lineage tracking

        To put it simply, what we finally present to the business is a business table that can be used directly, but it comes from many sources. If there is a problem with one of the source tables, we hope to be able to quickly and accurately locate the problem and understand its scope of harm. .

  • Reduce duplication of development

Standardizing data stratification and developing some common middle-tier data can reduce huge repetitive calculations.

  • Simplify complex problems

        Decompose a complex task into multiple steps to complete. Each layer only handles a single step, which is relatively simple and easy to understand. It also makes it easier to maintain data accuracy. When a problem occurs with the data, you don’t need to repair all the data. You only need to start repairing it from the problematic step.

  • Mask exceptions in raw data

Shield the impact of the business, and you don’t have to re-access the data after changing the business once.

5.4  ETL ELT  

5.4.1 Background 

        The data warehouse obtains data from various data sources and the data conversion and flow within the data warehouse can be considered as  the process of ETL ( Extraction , Transfer  , Load  ). But in actual operation, there are two different methods of loading data into the warehouse: ETL  and  ELT .

5.4.2 Concept 

  • ExtractTransformLoad  ETL

        Data is first extracted from a pool of data sources , which are typically transactional databases. The data is kept in a temporary staging database ( ODS ). Transformation operations are then performed to structure and transform the data into a form suitable for the target data warehouse system. The structured data is then loaded into the warehouse ready for analysis.

  • ExtractLoadTransform  ELT

        With ELT , data is loaded immediately after being extracted from the source data pool. There is no dedicated temporary database ( ODS ) , which means data is loaded immediately into a single centralized repository. Data is transformed in a data warehouse system for use with business intelligence tools ( BI tools  ) . The characteristics of data warehouses in the era of big data are very obvious.

6. Scenario analysis: Meituan-Dianping’s hotel and hotel digital warehouse construction practice 

6.1  Architecture changes 

        Within Meituan-Dianping’s wine and travel business group, the business has shifted from traditional group purchasing to richer product forms such as reservations and direct connections . The business system is also undergoing rapid iterative changes. These have impacted the scalability, stability, and ease of the data warehouse. Usability puts forward higher requirements . Based on this, Meituan has adopted a hierarchical and thematic approach to continuously optimize and adjust the hierarchical structure. The figure below shows the changes in the technical architecture.

        In the first generation data warehouse model , because Meituan’s overall business system at that time supported a relatively single product form (group purchase), and the business system contained data of all business categories, it was very important for the platform to process the basic layer of the data warehouse. It is appropriate that the platform is unified and built to support the use of various business lines. Therefore, at this stage, China Winery has only established a relatively simple data mart , a so-called small data warehouse .

        The second-generation data warehouse model has changed from building a data mart to directly building a data warehouse for the hotel , becoming the sole processor of data for the hotel's own business system . 

        With the integration of Meituan and Dianping, Liquor Travel's own business system has been restructured relatively frequently, which has had a great impact on the stability of the second-generation data warehouse model. It is very difficult for the original dimensional model to adapt to such a rapid change. Variety. The core problem is that the relationship between the business systems in use and the business lines is complex, and the differences between business systems are obvious and change frequently.

Therefore, a data integration layer         was added between the ODS and the multi-dimensional detail layer , and the data warehouse basic layer was built in a technology-driven manner from a business-driven approach. The most fundamental starting point for using this basic layer lies in the diversity of Meituan’s supply chain, business, and data. If the business and data are relatively single and simple, the architectural solution at this level will probably no longer be applicable.

6.2  Theme construction 

        In fact, in some traditional industries such as banking, manufacturing, telecommunications, retail, etc., there are some relatively mature models, such as the familiar BDWM (bank data) model . They are all developed through companies in similar industries. The industry experience accumulated in the construction of data warehouse for thirty years has been continuously optimized and generalized.

        However, the O2O industry in which Meituan operates does not have mature data warehouse themes and models that can be used for reference. Therefore, after two years of exploration and construction, Meituan has summarized the following seven themes that are more suitable for the current situation (there may be more in the future). new ) .

6.3  Overall architecture 

        After determining the technical and business themes, the overall structure of the data warehouse will be relatively clear. The seven themes of the Meituan Liquor and Travel Data Warehouse are basically  built using a six-layer structure. The division of themes is more from a business perspective, while the hierarchical division is based on technology, which is essentially based on the combination of business and technology. The overall data warehouse architecture.        

        Take the order topic as an example. In the process of building the order theme, Meituan follows the structural idea from point to point. First, it builds order-related entities (data integration middle layer) by supply chain, and then makes appropriate abstraction to break down the related orders of the supply chain . The entities are merged to generate an order entity (data integration layer), and then some dimension information is expanded based on the order entity in the data integration layer to complete the construction of subsequent levels. 

Guess you like

Origin blog.csdn.net/weixin_46560589/article/details/132984675