A simple understanding of the data warehouse Hive (1)


A data warehouse (DW or DWH) is a subject-oriented, integrated, time-varying, but relatively stable collection of information.

Three characteristics of data warehouse: (multiple choice questions)
  • Subject-oriented
  • Change with time
  • relatively stable
The main difference between database and data warehouse:
  • The database only stores current values; the data warehouse stores historical values;
  • The data in the database changes dynamically, and the data will be updated whenever a business occurs; the data warehouse is static historical data, which can only be added and refreshed regularly;
  • The data structure in the database is relatively complex, and there are various structures to suit the needs of the business processing system; the data structure in the data warehouse is relatively simple;
  • The frequency of data access in the database is high, but the amount of access is small; and the frequency of access in the data warehouse is low but the amount of access is high;
  • The goal of the data in the database is for business processing personnel, providing information processing support for business processing personnel; and the data warehouse is for senior management personnel, providing decision support for it;
  • The database requires fast response when accessing data, and its response time is generally within a few seconds; while the response time of the data warehouse can be up to several hours;
Two types of data processing:
  • Online transaction processing (OLTP): the main application of traditional relational databases, focusing on transaction processing
  • Online analytical processing (OLAP): the main application of data warehouse systems, complex analytical operations, focusing on decision support

Insert picture description here

Data warehouse structure:
  • Data source: It is the basis of the data warehouse, that is, the data source of the system, and usually contains various internal and external information of the enterprise.
    • Operate the database
    • Documentation
    • Other external sources
  • Data storage and management: It is the core of the entire data warehouse , which determines the representation of external data, extracts, cleans and effectively integrates the existing data of the system, and then organizes according to the theme.
    • Extract
    • Convert
    • load
    • Data warehouse-> data mart
  • OLAP server: Reorganize the data to be analyzed according to the multi-dimensional data model to support users to conduct multi-angle and multi-level analysis at any time and discover data laws and trends.
    • server
  • Front-end tools: mainly include various data analysis tools, report tools, query tools, data mining tools, and various applications developed based on data warehouses or data marts.
    • data query
    • data report
    • data analysis
    • Various applications

Insert picture description here

The data model of the data warehouse:
  • Star model (commonly used): It is a combination of a fact table and a set of dimension tables, with the fact table as the center. All dimension tables are directly connected to the fact table, but the dimension tables are not related.
    • Fact table: It is a measure of the analysis topic, which contains only the foreign keys associated with the table, and the records will continue to increase
    • Dimension table: record fact data or summarized data (dimension is an angle to analyze the data)
      Insert picture description here
  • Snowflake model: when one or more dimension tables are not directly connected to the fact table, but connected to the fact table through other dimension tables.
    Insert picture description here
Hive:
  • Originated from Facebook.
  • It is a data warehouse platform built on the hadoop file system. It provides a series of tools that can perform data extraction, conversion and loading (ETL) of data stored in HDFS. The large-scale data tool in Hive is a SQL parsing engine that translates SQL statements into MapReduce jobs and executes them on hadoop. (HiveQL is not case sensitive
  • Hive uses SQL query language HQL
The difference between Hive and MySQL:
Contrast Hive MySQL
Query language Hive QL SQL
Data storage location HDFS Block device, local file system
Data Format User-defined System decision
Data Update not support stand by
Business not support stand by
Execution delay high low
Scalability high low
Data scale Big small
The Hive system framework consists of:
  • User interface (CLI, JDBC / ODBC, WebUI)
  • Cross-language service (Thrift server) (call from different programming languages)
  • Low-level drive engine (compiler, optimizer, executor)
  • Metastore (Metastore) (table name, column, partition and other related information, stored in derby or MySql)
    Insert picture description here
Hive operating mechanism:
  • Users connect to Hive through the user interface and publish Hive SQL
  • hive parses queries and formulates query plans
  • hive converts queries into MapReduce jobs
  • hive executes MapReduce jobs on hadoop (to reduce the difficulty of analysts using hadoop for data analysis)
How hive works

Insert picture description here
1: UI sends query operation to Driver;
2: Driver analyzes the query with the help of compiler and expects to obtain query plan;
3: The compiler sends metadata request to Metastore;
4: Metastore sends metadata to the compiler in response;
5: The compiler checks the requirements and resends the plan to Driver;
6: Driver (drive engine) sends the execution plan to the execution engine to execute the task;
7: The execution engine obtains the results from the DataNode and sends the results to the UI Driver;

hive data model

All data in hive is stored in HDFS, which contains four data types (granularity is divided from large to small):

  • Database (Database): equivalent to the namespace in a relational database, its role is to isolate users and database applications to different databases or modes;
  • Table: Hive tables are logically composed of stored data and related metadata describing the form of table data. The data stored in the table is stored in the distributed file system. There are two types of tables in Hive, one is called internal table, the data of this table is stored in Hive data warehouse, and the other is called external table, the data of this table can be stored in distributed files outside Hive data warehouse In the system, it can also be stored in the hive data warehouse. . The hive data warehouse is also a directory in HDFS. This directory is the default path for Hive data storage. It can be configured in the Hive configuration file and eventually stored in the metabase.
  • Partition table (Partition): reflected in Hive storage is a subdirectory under the main directory of the table (the actual display of Hive table is a folder), the name of this subdirectory is the name of the partition column we defined; The purpose is designed to speed up data query speed.
  • Bucket table (Bucket): It is to divide the large table into small tables. The purpose of organizing tables or partitions into bucket tables is mainly to obtain higher query efficiency, especially sampling queries are more convenient. The bucket table is the smallest unit of the Hive data model. When the data is loaded into the bucket table, the value of the field is hashed, and then divided by the number of buckets to obtain the remainder for bucketing. Physically, each bucket table is a file of a table or partition.
    Insert picture description here
Published 72 original articles · praised 3 · visits 3534

Guess you like

Origin blog.csdn.net/id__39/article/details/105526913