Data warehouse modeling and ETL technology

Author: Zen and the Art of Computer Programming

1 Introduction

Data warehouse (Data Warehouse) refers to an integrated, subject-oriented enterprise-level information repository, a decision support system used to support management decision-making process, analysis, reporting and decision support. It has the following characteristics:

  1. Data integration: A data warehouse is a centralized repository where information from different data sources is integrated into the same database or data set. This method facilitates data analysis, report production and decision support.

  2. Topic modeling: Data warehouse analyzes and models data based on business topics. Data between different subjects are independent of each other. For example, sales data can be a separate table, and product prices can also be a separate table. Doing so helps improve data quality and efficiency.

  3. Data redundancy: Data in a data warehouse is either stored in the original data source or calculated, so the accuracy, completeness and consistency of the data source need to be maintained. Data redundancy is an important means to maintain data reliability and integrity.

  4. Abstraction level: The data warehouse is built on the abstraction level, namely fact (Fact), dimension (Dimension) and dimension pivot table (Cube). The fact table records all information generated by business activities; the dimension table records business entities and their related attribute information; the dimension pivot table merges the relationships between multiple dimension tables to form a multidimensional data set.

  5. Unified view: The data warehouse provides users with a unified, hierarchically structured view and provides query services to the outside world. Users only need to focus on the business data they need to know. Data warehouse designers need to design appropriate views for different purposes.

Overall, data warehouse has high data value and mining potential. However, building a data warehouse and its derivative systems requires technical expertise, tools, processing power, time, effort, and resources. Building a successful data warehouse and its tools also requires continuous improvement of human capabilities and levels. In this process, the quality, correctness, completeness and timeliness of data are also essential. This column "5. Data Warehouse Modeling and ETL Technology" will systematically introduce the data warehouse model and its ETL technology, and share the author's experience and insights accumulated in practice.

2. Explanation of basic concepts and terms

2.1 Data warehouse concept

Data Warehouse refers to a subject-oriented, integrated, and centralized data repository used to support complex analysis work. The main characteristics of data warehouse are:

  1. Integration: There are many data sources in the data warehouse, including transactions, historical records, master data, dimensions, statistical data, etc.

  2. Topic-oriented: The data warehouse is classified by topic, and each type of topic is stored in a separate table, making queries easier and faster.

  3. Centralization: The data warehouse is located on the central server, and all source data enters the data warehouse through extraction, transformation, and loading (ETL) to ensure the consistency, integrity, and timeliness of the data.

  4. Shareable: Data in the data warehouse can be directly used or exported by other departments, thus reducing costs, shortening development cycles, and accelerating innovation.

2.2 ETL technology

"Extract-Transform-Load" (ETL) is the most basic and important step in building a data warehouse. ETL refers to the process of extracting data from various sources such as databases, files, applications, etc., cleaning, converting, validating, summarizing, and importing it into the data warehouse. It is the cornerstone of the data warehouse and provides users with a unified and intuitive view. The main steps of ETL are shown in the figure below:

  1. Extract: Obtain data from sources, including databases, files, etc.

  2. Transform: Clean, transform, verify, standardize data, etc.

  3. Load: Import data into the data warehouse.

ETL is crucial to data quality. If there are errors or abnormalities in the ETL process, it may lead to inaccurate and distorted data in the data warehouse. Therefore, ETL engineers should generally be responsible for their own work and ensure the quality and accuracy of the ETL process.

3. Explanation of core algorithm principles, specific operating steps and mathematical formulas

3.1 Overview of ETL process

The construction of data warehouse needs to follow the ETL process. The ETL process consists of four stages, namely:

  1. Selection stage: determine the data sources and destinations that need to be obtained. Determining the data source includes internal data of each company and organization, and determining the data destination requires developing corresponding models according to needs.

  2. Data extraction stage: extract data and obtain valid data. Usually, there are some duplicate data items in the data source. Therefore, data deduplication needs to be performed before extracting data.

  3. Data conversion stage: This stage involves operations such as data cleaning, conversion, verification, and data standardization. This step is also one of the most complex and cumbersome aspects of ETL. Therefore, it involves a lot of business logic and conditional judgments.

  4. Data loading stage: Load data into the data warehouse for subsequent data analysis, reporting, decision support, etc.

3.2 Data model overview

The data model is the data organization form in the data warehouse, which specifies the location of data in the entire data warehouse and the connections between them. Data models are divided into star models, snowflake models, dimensional models, and topic models. Here are the definitions of these data models:

(1) Star Schema

The star schema, also known as the broadcast model, represents a relational data model that consists of a fact table and multiple dimension tables. Among them, the fact table records the original data of the measurement values ​​in the data set. A dimension table records the values ​​of attributes in a fact table. Dimension tables can also record the connection relationships between fact tables.

(2)Snowflake Schema

The snowflake model is a multidimensional data model that contains three or more tables, each of which is a fact table or a dimension table. In addition to recording fact data, each table can also record the primary keys of other tables to facilitate the association between tables.

(3) Dimension Modeling

Dimensional model is a technology based on two-dimensional tables that divides data into two categories: facts and dimensions. The fact table records the relationships between dimensions, while the dimension table records the attributes between facts. In the dimensional model, there is no explicit connection between the two tables, but they are implicitly related. Dimensional models trade off complexity, performance, and ease of maintenance and are a common model in many data warehouses.

(4) Topic Modeling

A topic model is a complex data model that organizes data from a topic perspective. The topic model divides data according to business topics to form multiple topic tables. Each topic table only records the facts of one topic, but also has corresponding dimension data. This enables data analysis and cross-analysis between topics. Topic models have flexibility and powerful analysis capabilities and are suitable for complex and diverse business environments.

3.3 Introduction to Hive

Hive is an open source distributed data warehouse system under the Apache Hadoop project. It provides HQL (Hive QL, Hive Query Language) for the storage, processing and analysis of structured data. The three basic components of Hive are:

  1. HDFS: used to store massive amounts of data.

  2. MapReduce: used for distributed computing of massive data.

  3. Hive Metastore: used to store metadata information.

3.4 Example of Hive table creation statement

Create fact table:

CREATE TABLE fact_table (
    customer_id INT, 
    order_date DATE,
    sales_amount DECIMAL(10,2),
    product_name STRING
);

Create dimension table:

CREATE TABLE dim_order_status (
    status VARCHAR
);

CREATE TABLE dim_product (
    product_id INT,
    product_category STRING,
    manufacturer STRING
);

Create a dimension table connected to the fact table:

ALTER TABLE fact_table ADD IF NOT EXISTS PARTITION (dt = 'yyyyMMdd');

ALTER TABLE fact_table ADD IF NOT EXISTS PARTITION (dt = 'yyyyMMdd');

INSERT OVERWRITE TABLE fact_table 
SELECT customer_id, 
       order_date,
       sales_amount,
       product_name,
       ods.status AS order_status,
       p.product_category,
       m.manufacturer
FROM   fact_table f
JOIN   dim_order_status ods ON f.order_status = ods.status
JOIN   dim_product p ON f.product_name = p.product_name AND
                      f.sales_amount > (
                          SELECT AVG(sales_amount)*1.5 FROM fact_table WHERE dt='yyyyMMdd'
                      )
JOIN   dim_manufacturers m ON p.manufacturer = m.manufacturer;

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133504675