hive basic structure and data storage

 

1. Introduction to Hive

 

Hive is a data warehouse tool based on Hadoop, which can map structured data files into a database table and provide SQL-like query functions. You can also convert SQL statements into MapReduce tasks to run, and query and analyze the required content through your own SQL. This set of SQL is referred to as HQL for short. The advantage of using hive is that the learning cost is low, and simple MapReduce statistics can be quickly implemented through SQL-like statements, without the need to develop special MapReduce applications, which is very suitable for statistical analysis of data warehouses .

Hive stores metadata in a database (RDBMS) such as MySQL, Derby. Hive has three modes to connect to data, the way is: single-user mode, multi-user mode and remote service mode. (that is, embedded mode, local mode, remote mode).

Hive Features:

1. Extensible

Hive can freely expand the scale of the cluster, and generally does not need to restart the service.

2. Scalability

Hive supports user-defined functions, and users can implement their own functions according to their own needs.

3. Fault tolerance

Good fault tolerance, SQL can still be executed if there is a problem with the node.

 

2. Hive Architecture

 

The Hive architecture is as follows:

Another picture in Chinese:

The Jobtracker in the first picture is a component in hadoop1.x, and its function is equivalent to that in hadoop2.x: Resourcemanager+AppMaster
TaskTracker is equivalent to: Nodemanager + yarnchild

As can be seen from the above figure, the Hive architecture is roughly divided into the following four parts:

1. User interface: including CLI, Client, WUI. The most commonly used one is CLI. CLI is a shell command line. When Cli starts, it will start a copy of Hive at the same time. Client is the client of Hive, and the user connects to the Hive Server. When starting the Client mode, you need to point out the node where the Hive Server is located, and start the Hive Server on this node. WUI is to access Hive through a browser.

2. Metadata storage: usually stored in relational databases such as mysql, derby

3. Interpreter, compiler, optimizer, executor: complete HQL query statement from lexical analysis, syntax analysis, compilation, optimization and query plan generation. The resulting query plan is stored in HDFS and executed in subsequent MapReduce calls.

4Hadoop: Data in Hive is stored in HDFS and calculated using MapReduce.

 

3. Data storage

 

First of all, you need to know the location of data storage in Hive. Metadata (that is, the description of data, including tables, table columns and other attributes) is stored in databases such as MySQL, because these data need to be constantly updated, modified, Not suitable for storage in HDFS.

The real data is stored in HDFS, which is more conducive to distributed computing on the data.

Hive mainly includes four types of data models:

1. Tables: Tables in Hive are conceptually similar to tables in relational databases. Each table has a corresponding directory in HDFS to store table data. This directory can be accessed through ${HIVE_HOME}/conf/ The hive.metastore.warehouse.dir property in the hive-site.xml configuration file is configured. The default value of this property is /user/hive/warehouse (this directory is on HDFS). We can modify this configuration according to the actual situation. . If I have a table wyp, the /user/hive/warehouse/wyp directory will be created in HDFS (here, it is assumed that hive.metastore.warehouse.dir is configured as /user/hive/warehouse); all data of the wyp table is stored in in this directory. This exception is for external tables.

2. External table: The external table in Hive is very similar to the table, but its data is not stored in the directory to which its own table belongs, but is stored elsewhere. The advantage of this is that if you want to delete the external table, the The pointed data will not be deleted, it will only delete the metadata corresponding to the external table; and if you want to delete the table, all the data corresponding to the table, including the metadata, will be deleted.

3. Partition: In Hive, each partition of the table corresponds to the corresponding directory under the table, and the data of all partitions are stored in the corresponding directory. For example, the wyp table has two partitions: dt and city, then the directory corresponding to dt=20131218, city=BJ corresponds to /user/hive/warehouse/dt=20131218/city=BJ, and all data belonging to this partition are stored in this in the directory.

4. Bucket: Calculate the hash of the specified column, and divide the data according to the hash value. The purpose is for parallelism. Each bucket corresponds to a file (note the difference between partitions and partitions). For example, if the id column of the wyp table is distributed into 16 buckets, first calculate the hash of the value of the id column, and the HDFS directory corresponding to the data storage with the hash value of 0 and 16 is: /user/hive/warehouse/wyp/part-00000; The HDFS directory for data storage with a hash value of 2 is: /user/hive/warehouse/wyp/part-00002.


Data storage reference: http://cloud.51cto.com/art/201507/484318.htm

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324929116&siteId=291194637