What is the basic structure of the data warehouse?

Insert picture description here
The purpose of the data warehouse is to build an analysis-oriented integrated data environment to provide decision support for enterprises. In fact, the data warehouse itself does not "produce" any data, and it does not need to "consume" any data. The data comes from external sources and is open to external applications. This is why it is called a "warehouse" instead of a "factory". the reason. Therefore, the basic structure of the data warehouse mainly includes the process of data inflow and outflow, which can be divided into three layers-source data, data warehouse, and data application:
Insert picture description here
it can be seen from the figure that the data in the data warehouse comes from different source data, and Provides a variety of data applications. Data flows into the data warehouse from top to bottom and then opens up applications to the upper level. The data warehouse is just a platform for integrated data management in the middle.

The data warehouse obtains data from various data sources and the data conversion and flow in the data warehouse can be regarded as the process of ETL (Extract Extra, Convert Transfer, Load Load). ETL is the pipeline of the data warehouse, and can also be regarded as the data warehouse. Blood, it maintains the metabolism of the data in the data warehouse, and most of the energy in the daily management and maintenance of the data warehouse is to maintain the normal and stable ETL.

The following is a brief introduction to the various modules in the data warehouse architecture. Of course, the data warehouse introduced here mainly refers to the website data warehouse.

Data source
of data warehouse For website data warehouse, click stream log is a main data source, it is the basic data of website analysis; of course, website database data is also indispensable, it records the data and various kinds of website operation The results of user operations are more accurate for analyzing data such as the website Outcome; others are documents that may be generated inside and outside the website and other various data useful for company decision-making.

Data storage of the data warehouse The
source data is exported through the daily task scheduling of ETL, and after conversion, it is stored in the data warehouse in the form of characteristics. In fact, this process has always been controversial about whether the data warehouse needs to store detailed data. One side's view is that the data warehouse is oriented to analysis, so it only needs to store a multi-dimensional analysis model for specific needs; the other side's view is that the data warehouse must be established first And maintain detailed data, and then aggregate and process detailed data to generate specific analysis models according to requirements. I prefer the latter point of view: the data warehouse does not need to store all the original data, but the data warehouse needs to store detailed data, and the imported data must be sorted and transformed to make it subject-oriented. Explain briefly:

(1). Why not need all the original data? Data warehouse is oriented to analysis and processing, but some source data has no value for analysis or the value it may generate is far lower than the implementation and performance cost of the data warehouse required to store these data. For example, we know that the user’s province and city are sufficient. As for where the user lives, it may only be a matter of concern to the logistics provider, or the user’s comment content on the blog may only be needed for text mining, but storing these lengthy comment texts in the data warehouse will not pay off ;

(2). Why save detailed data? Detailed data is necessary. The analysis requirements of the data warehouse will change from time to time. With the detailed data, we can do the same. But if we only store the data model built according to certain requirements, then it is obviously Changes in demand will be at a loss;

(3). Why should it be subject-oriented? Subject-oriented is the first feature of data warehouse, mainly refers to the reasonable organization of data to realize analysis. For the source data, its data organization is diverse. For example, the data format of the click stream is not optimized, and the data of the front-end database is optimized based on the OLTP operation organization. These may not be suitable for analysis, but are organized into subject-oriented The organization form of is really conducive to analysis. For example, the clickstream log is organized into three themes of page (Page), visit (Visit or Session), and user (Visitor), which can significantly improve the efficiency of analysis.

The data warehouse processes the data on the basis of maintaining detailed data, so that it can be truly applied to analysis. It mainly includes three aspects:

Data aggregation

The aggregated data here refers to simple aggregation based on specific needs (multidimensional data-based aggregation is reflected in the multidimensional data model). Simple aggregation can be aggregated data such as the total Pageviews, Visits, Unique Visitors of the website, or it can be Avg. time On page, Avg. time on site and other average data, these data can be directly displayed on the report.

Multidimensional data model

The multi-dimensional data model provides multi-angle and multi-level analysis applications, such as the sales star model and the snowflake model built based on time dimension and geographic dimension, which can realize cross-query in each time dimension and geographic dimension, and based on time dimension and geographic dimension Breakdown. Therefore, the application of multidimensional data models is generally based on online analytical processing (Online Analytical Process, OLAP), and data marts for specific demand groups will also be constructed based on multidimensional data models.

Business model

The business model here refers to the data model established based on some data analysis and decision support, such as the user evaluation model, correlation recommendation model, RFM analysis model, etc. I introduced before, or the linear programming model for decision support, Inventory model, etc.; at the same time, the processing of early data in data mining can also be completed here.

Data applications Data Warehouse
report showcase

Reports are an indispensable type of data application in almost every data warehouse. Aggregated data and multi-dimensional analysis data are displayed in reports, providing the simplest and most intuitive data.

Ad hoc query

In theory, all data in the data warehouse (including detailed data, aggregated data, multi-dimensional data and analysis data) should be open for ad hoc query. Ad hoc query provides a flexible enough data acquisition method. Users can query and obtain data according to their own needs and provide Export to external files such as Excel.

data analysis

Most of the data analysis can be based on the constructed business model. Of course, aggregated data can also be used for trend analysis, comparative analysis, correlation analysis, etc. The multidimensional data model provides a data basis for multidimensional analysis; at the same time, some samples are obtained from detailed data Specific analysis of data is also a common way.

Data mining

Data mining uses some advanced algorithms to make the data show a variety of surprising results. Data mining can be based on the business model that has been built in the data warehouse, but most of the time data mining will start directly from the detailed data, and the data warehouse provides data interfaces for mining tools such as SAS and SPSS.

Metadata management
Metadata (Meta Date) should actually be called explanatory data, that is, data describing data. It mainly records the definition of the model in the data warehouse, the mapping relationship between various levels, the data status of the monitoring data warehouse, and the task running status of ETL. Generally, metadata repository (Metadata Repository) is used to uniformly store and manage metadata. Its main purpose is to make the design, deployment, operation, and management of the data warehouse achieve synergy and consistency.

Finally, make an Ending. The data warehouse itself neither produces data nor consumes data, but serves as an intermediate platform to store data in an integrated manner; the difficulty of implementing the data warehouse lies in the construction of the overall architecture and the design of ETL, which is also in daily management and maintenance. The beginning; and the true value of the data warehouse lies in the data application based on it. If there is no effective data application, it will lose the meaning of building a data warehouse.

Guess you like

Origin blog.csdn.net/JACK_SUJAVA/article/details/108854088