Big Data Offline Phase 05: Data Warehouse, Hive

HDFS write data process

Basic concepts of data warehouse

Data warehouse, the English name is Data Warehouse, which can be abbreviated as DW or DWH. The purpose of the data warehouse is to build an analysis-oriented integrated data environment to provide decision support (Decision Support) for enterprises. It was created for analytical reporting and decision support purposes.

The data warehouse itself does not "produce" any data, and at the same time it does not need to "consume" any data. The data comes from outside and is open to external applications, which is why it is called "warehouse" instead of "factory". .

Key Features of a Data Warehouse

A data warehouse is a subject-oriented (Subject-Oriented), integrated (Integrated), non-volatile (Non-Volatile) and time-varying (Time-Variant) data collection to support management decisions.

subject-oriented

In traditional databases, the biggest feature is the organization of application-oriented data, and various business systems may be separated from each other. A data warehouse, on the other hand, is subject-oriented. Theme is an abstract concept, which is an abstraction for data synthesis, classification, analysis and utilization in enterprise information systems at a higher level. In a logical sense, it corresponds to the analysis object involved in a certain macro analysis field in the enterprise.

Integration

The data of the data warehouse is obtained by extracting, cleaning, converting and summarizing the scattered, independent and heterogeneous database data, which ensures the consistency of the data in the data warehouse with respect to the entire enterprise.

The comprehensive data in the data warehouse cannot be obtained directly from the original database system. Therefore, before the data enters the data warehouse, it must go through unification and synthesis. This step is the most critical and complicated step in the construction of the data warehouse. The tasks to be completed include:

(1) It is necessary to unify all contradictions in the source data, such as the same name of the field, the same name of the same name, the unit is not unified, the word length is inconsistent, and so on.

(2) Carry out data synthesis and calculation. The data synthesis work in the data warehouse can be generated when the data is extracted from the original database, but many of them are generated inside the data warehouse, that is, after entering the data warehouse for comprehensive generation.


non-volatile (non-updatable)

Operational databases mainly serve daily business operations, so that the database needs to continuously update data in real time in order to quickly obtain the latest data without affecting normal business operations. As long as the past business data is saved in the data warehouse, it is not necessary to update the data warehouse in real time for each business, but to import a batch of newer data into the data warehouse at intervals according to business needs.

time-varying

A data warehouse contains historical data of various granularities. Data in a data warehouse may relate to a particular date, week, month, quarter, or year. The purpose of the data warehouse is to mine the hidden patterns by analyzing the business operation status of the enterprise in the past period of time. Although users of the data warehouse cannot modify the data, it does not mean that the data in the data warehouse will never change. The results of the analysis can only reflect the past situation. When the business changes, the excavated models will lose their timeliness. Therefore, the data in the data warehouse needs to be updated to meet the needs of decision-making. From this perspective, data warehouse construction is a project, but also a process.

The difference between data warehouse and database

The difference between database and data warehouse is actually the difference between OLTP and OLAP.

Operational processing, called OLTP (On-Line Transaction  Processing  ,), can also be called a transaction-oriented processing system. It is a daily operation on the database for specific businesses, and usually queries and modifies a small number of records. Users are more concerned about the response time of operations, data security, integrity, and the number of concurrently supported users. As the main means of data management, the traditional database system is mainly used for operational processing.

Analytical processing, called OLAP (On-Line Analytical  Processing  ), generally analyzes historical data of certain subjects to support management decisions.

Introduction to Hive

What is Hive

Hive is a Hadoop-based data warehouse tool that can map structured data files into a database table and provide SQL-like query functions.

The essence is to convert SQL into a MapReduce program.

Main purpose: It is used for offline data analysis, which is more efficient than developing directly with MapReduce.

Why use Hive

Problems faced by directly using Hadoop MapReduce to process data:

The cost of personnel learning is too high

MapReduce is too difficult to implement complex query logic development

Using Hive:

The operation interface adopts SQL-like syntax, providing the ability of rapid development

Avoid writing MapReduce and reduce the learning cost of developers

Function extension is very convenient

Hive Architecture

Hive architecture diagram

Hive components

User interface : including CLI, JDBC/ODBC, WebGUI. Among them, CLI (command line interface) is the shell command line; JDBC/ODBC is the Java implementation of Hive, which is similar to the traditional database JDBC; WebGUI is to access Hive through a browser.

Metadata storage : usually stored in a relational database such as mysql/derby. Hive stores metadata in the database. The metadata in Hive includes the name of the table, the columns and partitions of the table and their attributes, the attributes of the table (whether it is an external table, etc.), the directory where the data of the table is stored, and so on.

Interpreter, compiler, optimizer, and executor : Complete the HQL query statement from lexical analysis, syntax analysis, compilation, optimization, and query plan generation. The generated query plan is stored in HDFS and subsequently executed by MapReduce calls.

The relationship between Hive and Hadoop

Hive uses HDFS to store data and uses MapReduce to query and analyze data.

Comparison between Hive and traditional databases

Hive is used for offline data analysis of massive data.

hive has the appearance of sql database, but the application scenarios are completely different. hive is only suitable for statistical analysis of batch data.

For a more intuitive comparison, please see the following picture:

Guess you like

Origin blog.csdn.net/Blue92120/article/details/132467198