6 Hive Overview

First, what is the Hive

Hive data warehouse platform is built on top of Hadoop; it by the SQL engine parses SQL statements translated into MapReduce jobs, and run on Hadoop; Hive table is HDFS file directory, a table corresponds to a directory name, if there is , the partition values ​​corresponding subdirectory.

Two, Hive architecture

Here Insert Picture Description

FIG explained elements :

1, the parser

1) translation device: HQL statement from the completion of the lexical analysis, parsing, compilation, optimization and generating an execution plan.

2) Optimizer: Evolution is a component

3) Actuator: Xu Shu perform all the Job, the chain does not exist if the task dependencies embodiment may be employed concurrently executed job.

2, Metabase

Metadata is used to store basic information Hive library, it exists in the relational database, such as mysql. It includes elements of: information database, table names, and column list of partitions and their properties, property sheet, catalog data table is located.

Third, hosting table with the external table

Hive table divided into two types, a table is managed (internal table), an external table.

Table is managed only in the hive, the external representation outside the hive is also used, there are the following two differences:

3.1 Data storage

Managed table: data stored in the hive repository directory is specified. I specified as: / apps / hive / warehouse

External Table: outer table may be present in any directory hdfs.

3.2 Data Delete

Managed Table: Remove metadata and data.

External table: delete only the metadata.

Fourth, partitions and buckets

4.1 Partition

Partition is in fact a large folder hdfs the following sub-folders, it is not the table above structure, the partition can help us to narrow the scope of the query to improve efficiency. When importing data, according to the data in a particular column of the data into the designated partition, i.e. the specified folder.

4.2 barrels

Barrel structure is attached above the table, search efficiency can be improved. MapReduce output corresponding to the tub file partition, and reduce tasks equal the number of barrels produced a job.

MapReduce output file partition, and reduce tasks equal the number of barrels produced a job.

My personal understanding is that corresponding to the table is a folder, partition is a subfolder under the folder that corresponds to the bucket and sub-folders in the file, and we have the same characteristics into the content of a document, namely a bucket.

Published 42 original articles · won praise 3 · Views 2046

Guess you like

Origin blog.csdn.net/stable_zl/article/details/105133252