Simple and clear introduction to data warehouse

As the name suggests, it is a warehouse that stores data. It collects data from various business systems. Taking the financial industry as an example, the data warehouse contains data on loan business, CRM, deposit business, etc. It is used by enterprises for data analysis, reporting, and decision-making; in some companies, it is also used as the data source for various business systems.

Logically speaking, there is no difference between a database and a data warehouse. They are both places where data is stored through database software. However, in terms of data volume, the data warehouse is larger than the database.

The main difference between them is that traditional transactional databases such as MySQL are used for online transaction processing (OLTP), such as the occurrence of transaction events, while data warehouses are mainly used for online analytical processing (OLAP), such as producing reports.

Some students may think that data analysis, report generation and other tasks can also be completed directly through the business database, and a data warehouse does not seem to be a necessity.

If it is a simple system, such as a system composed of a few servers and several MySQLs during the start-up period with small business volume, few users and data, then it can indeed be achieved. But when the business grows more and more, the amount of users and data is huge, and reporting needs to be achieved by correlating data from multiple systems across clusters, then a data warehouse is still necessary.

If you still don’t understand, think about a few questions first

If the data you want is stored in many different databases or even exists in various log files, how do you get the data?

If you take out the data you want from various data sources, but find that the format is different, or the data type is different, how do you standardize it?

If one day you need to check historical data in the business system, but find that the data has been modified, what will you do?

What should we do if we want to correlate data from different business systems across clusters? How to optimize query time?

The emergence of data warehouses can well solve the above problems. It integrates data from various business systems into one system (data warehouse) through data extraction and cleaning, standardizes data, and facilitates obtaining data when producing reports and making decisions.

Features of Data Warehouse

Integration

The data stored in the data warehouse comes from multiple data sources, and the original data is stored in different data sources in different ways. To be integrated into the final data collection, a series of extraction, cleaning, and transformation processes are required from the data source.

stability

The data saved in the data warehouse are historical records and are not allowed to be modified. Users can only query and analyze through analysis tools.

dynamic

The data in the data warehouse will be updated regularly as time changes. Regular updates here do not refer to modifying the data. Generally, data that changes in the business system are regularly synchronized to the data warehouse, which does not conflict with stability. Non-updatable is for applications, that is, the data is not updated during user analysis and processing.

Thematic

Traditional databases correspond to different businesses, and the data warehouse needs to integrate data from different data sources according to needs, that is, the data is generally modeled around a certain business theme. For example, "Loan" topic, "Deposit" topic, etc.

Data Warehouse Hierarchy

Data warehouses are generally hierarchical, and each company is hierarchical based on its own business scenarios. The current hierarchies can be said to be diverse, and there is no standard answer. But the most mainstream method is to layer them like this:

The meaning of data warehouse layering

To reduce repeated development, an intermediate layer can be generated during the data development process to sink common logic and reduce repeated calculations;

Clear data structure, clear division of labor for each layer, easy for developers to understand;

It is convenient to locate problems, understand the blood relationship of data through layering, and locate problems through backtracking when problems occur;

Simplifying complex problems is similar to the idea of ​​divide and conquer. Divide and conquer simplifies complex problems.

Disclaimer: The articles published by this public account are original to this public account, or are edited and compiled from excellent articles searched on the Internet. The copyright of the article belongs to the original author and is only for readers and friends to learn and refer to. For the non-original articles shared, some of them cannot find the real source. If the source is wrongly marked or the pictures, links, etc. used in the article include but are not limited to software, materials, etc., if there is any infringement, please contact the backend directly and explain the details. Articles will be deleted as soon as possible in the background. Apologies for the inconvenience caused.

Guess you like

Origin blog.csdn.net/weixin_44958787/article/details/132601881