Data warehouse knowledge points

What is a data warehouse?

Data warehouse is a structured data environment for decision support systems and online analysis application data sources. Data warehouse research and solve the problem of obtaining information from the database.
The database is characterized by subject-oriented, integrated, stable and time-varying, used to support management decision-making.
The significance of the data warehouse is to summarize all the data of the enterprise and provide a unified and standardized data export for all departments of the enterprise.

  • Object-oriented: The data in the data warehouse is organized according to certain subject areas, and each subject corresponds to a macro analysis area. The data warehouse excludes data that is not useful for decision-making and provides a concise view of specific topics.
  • Integrated: enterprise-level data, while the data must maintain consistency, integrity, validity, and accuracy
  • Stable: From a certain period of time, it remains unchanged, there are no update operations, delete operations, and query analysis
  • Changing: The data warehouse will completely record the changes of an object over a period of time

The goal of a data warehouse is to achieve an organized and structured collection of stored data that is integrated, stable, and reflects historical changes.

1

 

Log data : data collection (js collection, java code) through sdk (soft development kit), the so-called sdk is that we develop some tools to collect the data of user interaction with the front-end (click, browse, like, advertisement, error log), The collection method is to monitor events. After collection, the data is encrypted, compressed, transcoded, sent in real time, sent regularly, and may also be sent according to network conditions, and needs to be sent to the back-end log server.
Business data : The data recorded in the database, which records the data of each business process based on the transaction mechanism.

What is a database?

A database is a warehouse built on a computer storage device to organize, store and manage data according to a data structure.
A database is an organized and shareable collection of data stored in a computer for a long time. The data in the database refers to the organization, description and storage of a certain data model, with the smallest possible redundancy, high data independence and easy scalability, and can be multiple within a certain range User sharing.
The database is used to support the business. It needs to have a very fast response speed and no delay. The query is all inquiries one by one, and all the relevant data is obtained. This kind of relational database is suitable. The data warehouse is mainly used to support analysis .

Data warehouse and database comparison

Data Warehouse Modeling Theory

ER (EntityRelation) entity model

  • The ER model is the theoretical basis of database design. Almost all current OLTP system designs use ER model modeling.
  • The data warehouse theory proposed by Bill Inom recommends the use of ER relational model for modeling
  • The BI architecture proposes a layered architecture, and the ods and dwd at the bottom of the data warehouse are mostly designed with ER relationship modeling.
  • ER model modeling standard: try to avoid data redundancy

 Dimensional modeling

Dimensional modeling comes from the data mart, mainly for analysis scenarios: data warehouse modeling; mainstream OLAP engine underlying data model

                Star:

 

               snowflake:

Comparison of snowflake model and star model:

  • Redundancy: The snowflake model conforms to the business logic design and adopts 3NF design to effectively reduce data redundancy ; the dimension table design of the star model does not conform to 3NF and is denormalized. There will be no direct correlation between the dimension tables, and special storage space
  • Performance: Due to the relationship between the dimensions of the snowflake model, 3NF is used to reduce redundancy. Usually in the process of use, more dimension tables need to be connected, resulting in low performance; the star model reverses the three-paradigm and uses dimensionality reduction operations to reduce the dimensions. Integration, effectively reducing the number of dimensional table connections at the cost of storage space, and the performance is higher than that of the snowflake model;
  • ETL (data cleaning): The snowflake model conforms to the business ER model design principle. The ETL process is relatively simple, but due to the limitations of the stage model, the parallelization of ETL tasks is low; the star model is anti-paradigm design when designing the dimension table, so it is It is difficult to integrate business data into the dimension table in the ETL process, but due to the avoidance of phased dimensions, it can be processed in parallel

dataVault model

( Data Vault is derived from the ER model. The initial knowledge of model design is to effectively organize basic data, make it easy to expand, and flexibly respond to business changes, while emphasizing history, traceability and atomicity. Do not require excessive consistency processing of data, not designed for analysis scenarios )

Contains three structures: satellite table-- satallite: historical descriptive data, the real carrier of data in the data warehouse

                         Link table-link : Represents the relationship between the central tables, and connects the business relationship of the entire enterprise through the link table

                         Central table- hub; a list of unique business keys, which uniquely identifies the actual business of the enterprise and the set of business entities of the enterprise

 

anchor

  • Anchor is a closer standard processing of the dataVault model. The first understanding is to design a highly scalable model. The core idea is that all extensions are only added without modification, so the designed model basically becomes a kv structure model, the model The paradigm reached 6NF
  • Map model

 

Modeling summary

The above are four basic modeling methods: the current mainstream modeling methods are: ER model (mainly used in databases), dimensional model (mainly used in data warehouses)

  • The ER model is often used for OLTP database modeling. When it is applied to the construction of data warehouses, it is more focused on data integration. From the perspective of the overall consideration of the enterprise, the data of each system is consistent and merged for data analysis and decision-making services. It can be used directly to support analysis.
  • Dimensional modeling is oriented analysis scene was born, and to build a data warehouse model for scene analysis; focus on fast, flexible solution analysis needs, while providing rapid response to large-scale data targeted performance, mainly used in data warehouse and build. OLAP engine low-level data model
  • No need to thoroughly sort out business processes and data
  • The implementation cycle is determined by the subject boundary, and it is easy to quickly implement the demo
  • Try to be redundant, because the data warehouse (hive) is followed by hdfs, and the hard disk space is unlimited; it is recommended to use the star model, you can use the snowflake model, but not too many levels;

 

Guess you like

Origin blog.csdn.net/Poolweet_/article/details/107720906