First, what is the Hive
Hive data warehouse platform is built on top of Hadoop; it by the SQL engine parses SQL statements translated into MapReduce jobs, and run on Hadoop; Hive table is HDFS file directory, a table corresponds to a directory name, if there is , the partition values corresponding subdirectory.
Two, Hive architecture
FIG explained elements :
1, the parser
1) translation device: HQL statement from the completion of the lexical analysis, parsing, compilation, optimization and generating an execution plan.
2) Optimizer: Evolution is a component
3) Actuator: Xu Shu perform all the Job, the chain does not exist if the task dependencies embodiment may be employed concurrently executed job.
2, Metabase
Metadata is used to store basic information Hive library, it exists in the relational database, such as mysql. It includes elements of: information database, table names, and column list of partitions and their properties, property sheet, catalog data table is located.
Third, hosting table with the external table
Hive table divided into two types, a table is managed (internal table), an external table.
Table is managed only in the hive, the external representation outside the hive is also used, there are the following two differences:
3.1 Data storage
Managed table: data stored in the hive repository directory is specified. I specified as: / apps / hive / warehouse
External Table: outer table may be present in any directory hdfs.
3.2 Data Delete
Managed Table: Remove metadata and data.
External table: delete only the metadata.
Fourth, partitions and buckets
4.1 Partition
Partition is in fact a large folder hdfs the following sub-folders, it is not the table above structure, the partition can help us to narrow the scope of the query to improve efficiency. When importing data, according to the data in a particular column of the data into the designated partition, i.e. the specified folder.
4.2 barrels
Barrel structure is attached above the table, search efficiency can be improved. MapReduce output corresponding to the tub file partition, and reduce tasks equal the number of barrels produced a job.
MapReduce output file partition, and reduce tasks equal the number of barrels produced a job.
My personal understanding is that corresponding to the table is a folder, partition is a subfolder under the folder that corresponds to the bucket and sub-folders in the file, and we have the same characteristics into the content of a document, namely a bucket.