Original: Data Warehouse Architecture and Construction Method

1. Data warehouse overview

1.1. The origin of data warehouse

     Before building a data warehouse, data was scattered in the data storage applied by various departments of the enterprise, and there were complex business connections between them. On the whole, it looked like a huge spider web: complex in structure, but extending in all directions . In enterprise-level data applications, a single business is easy to use and flexible; but when it comes to cross-business and multi-department joint applications, there will be: ① data sources are diversified, and management decision-making data is too scattered; ② data lacks standards and is difficult to integrate; ③The data caliber is not uniform and the reliability is low; ④The data quality is difficult to guarantee due to the lack of data management and control system. As shown below:


     If an enterprise does not have an overall plan for data construction and adopts a natural evolution method, then in the process of data application in the future, it will have to face the following problems:

  • Lack of data credibility: lack of unified dimensions; differences in data algorithms; multiple levels of extraction; external data problems; no initial public data source;
  • Low productivity: Enterprise reports need to be generated based on all data; a large number of files need to be browsed for locating data; there are many extraction programs, and each of them is customized, so many technical obstacles have to be overcome.
  • Infeasibility of turning data into information: Data is not integrated; lack of historical data needed to turn data into information.

   Based on the above problems, it is necessary to establish an enterprise-level data warehouse.

1.2. Data warehouse development

    The embryonic stage of data warehouse: MIT (Massachusetts Institute of Technology) conducted a lot of research in the 1970s. After a series of tests and demonstrations, it finally proposed to separate the business system and the analysis system, and divide the business processing and analysis processing into different levels. That is, the conclusion is as follows: Analysis system and business system can only be handled separately by using completely different architecture and design methods.

    The exploratory stage of the principle, architecture and specification of the data warehouse: In 1988, IBM proposed "Information Warehouse", the goal is to solve the problem of enterprise data integration, in the design can realize "a structured environment that can support end users to manage all their business and support the IT department in ensuring data quality.” But IBM is only using this advanced concept for marketing, but not put into practice the architectural design.

    The data warehouse was formally proposed: In 1991, Bill Inmon published the first book of data warehouse "Building the Data Warehouse" , which proposed the concept of data warehouse, explained why a data warehouse should be established, and also gave a way to build a data warehouse.

1.3. Data warehouse definition

     A data warehouse is a subject-oriented, integrated, relatively stable collection of data that reflects historical changes (changes over time). It mainly supports the decision analysis of enterprise managers. The data warehouse collects a series of historical data such as data sources and archived files of various internal and external business systems related to the enterprise, and finally converts it into the strategic decision-making information required by the enterprise.

1.3.1. Features of Data Warehouse

  • Topic-oriented: Ordinary operational databases are mainly oriented to transactional processing, while all data in a data warehouse is generally divided according to topics. The theme is an abstraction of business data, which is to summarize and organize the data in the information system from a higher level. The subject-oriented data can be divided into two parts - extracting the subject according to the characteristics of the original system business data and determining the data content contained in each subject. For example, customer topics, product topics, financial topics, etc.; and customer topics include basic customer information, customer credit information, and customer resource information. When analyzing data warehouse topics, the general method is to first determine a few basic topics, then expand the scope, and finally refine them gradually.
  • Integration: Operation-oriented databases are usually heterogeneous and independent of each other, so it is impossible to generalize and reflect the essence of information. The data in the data warehouse is obtained through data extraction, cleaning, switching, and loading. Therefore, in order to ensure that there is no ambiguity in the data, the data must be uniformly encoded and summarized to ensure the consistency of the data in the data warehouse. . After the data warehouse goes through the data integration stage, the data in the data warehouse abides by the unified coding rules and eliminates many redundant data.
  • Stability: The data in the data warehouse reflects the data content of a historical period, and its main operations are query and analysis without general updates (the operational database before data integration mainly completes the addition and modification of data. , delete, query), once a certain data enters the data warehouse, the data will generally be retained for a long time, and will only be deleted when it exceeds the specified period. Usually the work that the data warehouse needs to do is to load, query and analyze, generally without any modification operations, it is for the decision-making analysis of the high-level personnel of the enterprise.
  • Reflect historical changes: The data warehouse continuously obtains changed data from operational databases or other data sources to analyze and predict the required historical data. Therefore, the key codes (dimensions) of data tables in general data warehouses contain time keys to indicate the data information of the historical period, and then continuously add new data content. Through this historical information, we can analyze and predict the development history and trend of the enterprise. The construction of a data warehouse requires a large amount of business data to be accumulated, and the precious historical information is processed, sorted, and finally provided to decision-making analysts. This is the fundamental purpose of data warehouse construction.

1.3.2. Advantages of data warehouse

  • Simplified information flow after data integration
  • Increased utilization of shared data
  • Centralized data management, single source
  • Form a single view of business and standardize data
  • Data management and control system ensures data quality


    In the construction of data warehouse, we have repeatedly emphasized the need for data model, so why is the data model so important? First of all, we need to understand the development history of the construction of the entire data warehouse. The development of the data warehouse has roughly gone through the following three processes: Through the development stage of the data warehouse construction, we can see that the important difference between the construction of the data warehouse and the construction of the data mart lies in the support of the data model. Therefore, the construction of the data model has decisive significance for the construction of our data warehouse. Generally speaking, the construction of data model can mainly help us solve some of the following problems: 3.3.1. Business modeling From the definition, business model is the highest level data model, which mainly completes: The main theme of the subject domain model of the data warehouse and important business relationships. Generally speaking, before designing and developing a data warehouse system, design developers and business personnel have reached a consensus on the division of subject domains through preliminary business modeling, because the subject domain model reflects the core business issues. The design steps of the topic domain model are as follows: The process of topic domain modeling can be roughly divided into the following parts: In the process of business modeling in the previous stage, the data of the business system has been sorted out. According to the characteristics of each business, a detailed list of data topics is listed, and each data topic is explained in detail. Then, after induction and classification, they are organized into various data topic fields, and the parts of each data topic field are listed. Each data subject area is explained in detail, and finally divided into the subject area conceptual model. 3.3.3. Logical Modeling By definition, a logical model is based on a conceptual model and further refines and decomposes the conceptual model. The logical model describes the requirements of the business and the technical field of system implementation through the relationship between entities and entities, and is a bridge and platform for communication between business requirements personnel and technical personnel. The design of the logical model is the most important step in the implementation of the data warehouse, because it directly reflects the actual needs and business rules of the business department, and at the same time has a guiding role in the design and implementation of the physical model. Its characteristic is to outline the data blueprint and rules of the entire enterprise through the relationship between entities and entities. The subject domain of the conceptual model is generally the subject domain of the business model summarized from the existing information system of the enterprise and the business activities of the industry itself. In addition to enriching and refining the subject areas based on the conceptual model, and determining which subjects each subject area contains, the logical model also requires: On the basis of the logical model, the process of selecting a suitable physical structure for the application production environment , including suitable storage structures and storage methods, is called the design process of the physical model. The transformation of a logical model into a physical model includes the following steps: As can be seen from the examples listed above, the abstract induction method we use is actually very simple, and any business can be regarded as three parts: Due to the entity modeling method, it can be easily implemented The division of business modeling. Therefore, in the business modeling stage and the domain modeling stage, the entity modeling method has a wide range of applications. Generally, in the absence of ready-made industry modeling, the method of entity modeling can be used to clean up the entire business model with customers, divide domain concepts, abstract specific business concepts, and combine the use characteristics of customers. Create a data warehouse model that meets your needs. However, entity modeling also has its own inherent flaws. Since the entity specification method is only a method of abstracting objective events, it is destined that this modeling method can only be limited to the stage of business modeling and domain concept modeling. Therefore, when it comes to the logical modeling stage and the physical modeling stage, it is the stage where paradigm modeling and dimensional modeling play their strengths. Due to the entity modeling method, the division of business modeling can be easily realized. Therefore, in the business modeling stage and the domain modeling stage, the entity modeling method has a wide range of applications. Generally, in the absence of ready-made industry modeling, the method of entity modeling can be used to clean up the entire business model with customers, divide domain concepts, abstract specific business concepts, and combine the use characteristics of customers. Create a data warehouse model that meets your needs. However, entity modeling also has its own inherent flaws. Since the entity specification method is only a method of abstracting objective events, it is destined that this modeling method can only be limited to the stage of business modeling and domain concept modeling. Therefore, when it comes to the logical modeling stage and the physical modeling stage, it is the stage where paradigm modeling and dimensional modeling play their strengths. Due to the entity modeling method, the division of business modeling can be easily realized. Therefore, in the business modeling stage and the domain modeling stage, the entity modeling method has a wide range of applications. Generally, in the absence of ready-made industry modeling, the method of entity modeling can be used to clean up the entire business model with customers, divide domain concepts, abstract specific business concepts, and combine the use characteristics of customers. Create a data warehouse model that meets your needs. However, entity modeling also has its own inherent flaws. Since the entity specification method is only a method of abstracting objective events, it is destined that this modeling method can only be limited to the stage of business modeling and domain concept modeling. Therefore, when it comes to the logical modeling stage and the physical modeling stage, it is the stage where paradigm modeling and dimensional modeling play their strengths. 3.4.2. Paradigm modeling method     According to Inmon's point of view, the construction method of the data warehouse model is similar to the enterprise data model of the business system. In the business system, the enterprise data model determines the source of data, and the enterprise data model is also divided into two levels, namely the subject domain model and the logical model. Similarly, the subject domain model can be regarded as the conceptual model of the business model, while the logical model is the instantiation of the domain model on the relational database.

  When shifting from the business data model to the data warehouse model, the domain model of the data warehouse, that is, the conceptual model, is also required, as well as the logical model of the domain model. Here, the data model in the business model is slightly different from the data warehouse model. The main differences are: The biggest advantage of the paradigm modeling method is that from the perspective of the relational database, combined with the data model of the business system, it can be more convenient Implement data warehouse modeling. However, its shortcomings are also obvious. Because the modeling method is limited to the relational database, it limits the flexibility and performance of the entire data warehouse model at some times, especially considering the transfer of the underlying data of the data warehouse to the data mart. When data is aggregated, certain modifications are required to meet the needs of the response. 3.4.3. Dimensional Modeling Dimensional modeling was first proposed by Kimball. The simplest description is: build data warehouses and data marts according to fact tables and dimension tables. The most widely known name for this approach is star modeling.

     The above figure is the most typical star architecture in this architecture. The reason why the star schema is widely used is that a lot of preprocessing is done for each dimension, such as pre-statistics, classification, sorting, etc. according to the dimension. Through these preprocessing, the processing capability of the data warehouse can be greatly improved. Especially for 3NF modeling methods, star schema has obvious advantages in performance. At the same time, another advantage of the dimensional modeling method is that the dimensional modeling is very intuitive. It only revolves around the business model and can intuitively reflect business problems. Dimensional modeling can be done without special abstraction. This is also the advantage of dimensional modeling. However, the disadvantage of dimensional modeling is also very obvious. Since a lot of data preprocessing is required before building the star model, it will lead to a lot of data processing work. Moreover, when the business changes and the definition of the dimension needs to be re-defined, it is often necessary to re-process the dimension data. In the process of these preprocessing, it often leads to a large amount of data redundancy. Another disadvantage of dimensional modeling is that if it is purely dimensional modeling, the consistency and accuracy of data sources cannot be guaranteed, and at the bottom of the data warehouse, it is not a method that is particularly suitable for dimensional modeling. 4. Dimensional modeling     Dimensional modeling has a certain order, which are: ① business processing ② granularity ③ dimension ④ fact. 4.3. The data warehouse is divided at the dimensional modeling level

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326647637&siteId=291194637