Dry goods | The data warehouse knowledge you want to know is available here!

Data Warehouse is a subject-oriented (Subject Oriented), integrated (Integrate), stable (Non-Volatile), reflecting historical changes (Time Variant) data collection, used to support management decision-making.

  • Subject-oriented: The data warehouse focuses on data analysis, so the data in the data warehouse is organized and stored according to a certain theme.
  • Integration: systematically process and organize the original scattered database data to eliminate inconsistencies in the source data.
  • Stability: After data enters the data warehouse, it only needs to be loaded and refreshed regularly, without frequent modification.
  • Reflect historical changes: For decision-making needs, the data in the data warehouse must be marked with time attributes. Through these data information, a quantitative analysis and prediction of the development process and future trends of the enterprise can be made.

——The difference between database and data warehouse

Database and data warehouse are actually very similar. Both use database management system to organize, store and manage data. Their differences are:

The database is a collection of original data , mainly storing transactional data in business processes, such as bank transactions, order records, etc. The data warehouse is an upgrade of the database concept. It stores processed data collections , mainly storing data integrated and summarized from the database, used for analysis of historical data on certain topics, and focusing on decision support.

It is a little bit obscure in concept alone, any technology serves applications, and it can be easily understood in combination with applications. Take banking business as an example. The database is the data platform of the banking transaction system. Every transaction made by a customer in the bank will be written into the database and recorded. The data warehouse is the data platform of the analysis system. It obtains data from the transaction database, summarizes and processes it, and provides decision-makers with a basis for decision-making. For example, how many transactions occur in a certain branch of a certain bank in a month, and what is the current deposit balance of the branch. If there are more deposits and more consumer transactions, it is necessary to install an ATM in this area. Obviously, the bank's transaction volume is huge, usually in millions or even tens of millions of times. The transaction system requires timeliness. It is unbearable for customers to wait for tens of seconds for a sum of money, which requires the database to respond in real time. The analysis system is after the fact, it must provide all the valid data in the time period of interest. These data are massive, and the summary calculation is slower, but as long as it can provide effective analysis results, it will achieve the goal.

The difference between a database and a data warehouse is actually the difference between OLTP and OLAP.

  • Operational processing, called OLTP (On-Line Transaction Processing), can also be called transaction-oriented processing system, which is the daily operation of specific business online in the database, usually querying and modifying a few records. Users are more concerned with issues such as operational response time, data security, integrity, and the number of concurrent support users. As the main means of data management, traditional databases are mainly used for operational processing.
  • Analytical processing, called On-Line Analytical Processing (OLAP), supports complex analysis operations, focuses on decision support, and provides intuitive and easy-to-understand query results.

Basically every enterprise will go through the stage from database to data warehouse. Take the e-commerce industry as an example:

  • In the early days of the e-commerce industry, entry barriers were low. Find an outsourcing team and build a web front end + a few servers + a MySQL to open the door to welcome customers.
  • In the second stage, when the traffic comes, customers and orders are increasing. At this time, the architecture needs to be upgraded to multiple servers and multiple business databases (distributed storage). At this stage, the business data and indicators can barely change from the business. Query in the database.

  • In the third stage, with the development of the business, the amount of data has increased exponentially, and the business problems faced are becoming more and more complex. Leaders’ concerns have evolved from very extensive at the beginning: "What was your income yesterday", "What was your PV and UV last month", and gradually evolved to very refined and specific consumer behavior analysis, such as "20~ The performance of 30-year-old female users’ purchase behavior of cosmetic products in the first quarter of the past five years in promotional activities." This kind of very specific data that can play a key role in the company's decision-making is difficult to retrieve from the business database. The reasons are: 1. The data structure in the business database is designed to complete the transaction, not for the convenience of query and analysis. 2. Most business databases are optimized for reading and writing, that is, they need to be read (check related product information) and write (generate orders and complete payments). Therefore, support for large-scale data reading (complex query indicators) is insufficient. In order to solve such problems, a data warehouse needs to be established. Its functions are: 1. The data structure is designed for the convenience of analysis and query. 2. Read-only optimized database, that is, it does not need how fast it is to write, as long as the speed of complex queries for large amounts of data is fast enough.

——ETL

Data in a data warehouse is usually extracted from multiple data sources, integrated and summarized to become historical records in the data warehouse. Multiple data sources (internal business databases, external files, crawlers, third-party APIs, etc.) have different data storage methods, so they need to be extracted, cleaned, and converted. The processing of data from the database to the data warehouse is ETL (Extract-Transform-Load):

  • Extract: Data extraction is to read data from multiple data sources
  • Transform: Data conversion is to convert data into a unified format
  • Load: Load data, load the processed data to the data warehouse

Commonly used ETL tools: Datastage, Informatica, Kettle

-Hierarchical storage of data warehouse

Generally speaking, the data warehouse will be divided into at least three levels: ODS, DSA, and EDW. Of course, the names of the levels may be different for each company. The main purpose here is to differentiate and explain the functions.

  • The ODS layer stores new or updated data in the business database within a time range. Its storage increases linearly. Only when data changes, ODS will store the data, which is equivalent to a copy of the business database.
  • The DSA layer is the data extracted, cleaned, and converted through the ODS layer.
  • The EDW layer is a merged layer after abstracting the business model of the DSA layer, simplifying some redundant library tables and making them more conducive to data extraction.

The input of the data warehouse is a variety of data sources, and the final output is used for data analysis, data mining and data reporting for the enterprise.

——Commonly used data warehouse

Hive is a Hadoop-based data warehouse tool that can query, analyze and process file data sets stored on HDFS. Hive provides HiveSQL, a query language similar to SQL, which converts HiveSQL statements into MapReduce tasks when making queries, and executes them on the Hadoop layer.

HDFS is a distributed file system of Hadoop, which serves as the storage layer of the data warehouse. The Data Node in the figure is the many working nodes of HDFS.

MapReduce is a parallel computing model for massive data, which can be simply understood as data conversion and merging of multiple data fragments.

The Teradata data warehouse is equipped with the highest performance and most reliable massively parallel processing (MPP) platform, which can process massive amounts of data at high speed, and its performance is much higher than that of Hive.

 

Guess you like

Origin blog.csdn.net/yoggieCDA/article/details/109765795