Getting Started with Hive, Hive vs SQL, and Hive's Architecture

Getting Started and Installation of Hive

In the previous article, we introduced MapReduce and how to implement a WordCount demo using MapReduce. Although it is a very simple function, it is not difficult to find that a lot of code related to MapReduce will still be written. But this kind of MapReduce code needs programmers who understand it to get it done, and it is more difficult to maintain later. At this time, it is not difficult to find that there are many problems:

1) When querying files in HDFS or tables in HBase, you need to manually write a lot of MapReduce code, just to realize a query function.

2) Statistical tasks can only be realized by programmers who understand MapReduce, so the learning cost is relatively high.

3) Time-consuming and labor-intensive, a lot of energy wasted completely, and the energy was not released effectively.

On this basis, Hive was born. Its essence is actually MapReduce, which is a higher-level application after encapsulating MapReduce, which can reduce the writing of MapReduce. And most importantly, Hive is a data warehouse tool based on Hadoop, and in data warehouse, SQL is the most commonly used analysis tool, so the essence of Hive is also a SQL analysis engine, that is to say, we can write SQL directly The purpose of development can be quickly achieved, and the rest will be handed over to Hive to translate SQL statements into MapReduce Jobs to complete our needs and greatly save time, cost and energy.

Although Hive also uses SQL, there are still many concepts that are quite different from the real database. The table in Hive is just a purely logical table, just the definition of the table, that is, the metadata of the table, which is essentially the directory or file of Hadoop.

Hive database tables in the Hadoop directory

The metadata of the Hive table will not exist locally, but depends on Mysql, so you need to install Mysql on the Hive node, and you need to modify the Hive configuration file to point it to the corresponding Mysql database.

You can clearly see that by connecting to the client through Mysql, after Hive is used, some metadata about the table will be created in the database you configured, including table fields, partitions, formats, parameters, etc.

properties of the form

 

Field information for each table

From the picture above, we can also understand that the separation of metadata and data storage is achieved in Hive, and Hive can directly map structured data files into a database table.

Of course, for data warehouse tools such as Hive, it has a disadvantage: it does not support rewriting and deleting of data, that is, read more and write less. It is not recommended to perform frequent modification or deletion operations. The bottom layer involves processing massive data, so as not to directly affect the performance of the entire cluster.

The figure below vividly illustrates the difference between introducing Hive to implement wordCount and writing native MapReduce. It is obvious that the efficiency of using Hive has been greatly improved. The explode function plays the role of turning rows into columns, and collect_list plays the role of turning columns into rows.

 

The installation process of Hive is very simple, here are some installation matters to note:

① Download apache-hive-1.2.2-bin.tar.gz, and unzip it, and create a hive-site.xml configuration file in the conf directory.

② Configure environment variables: vim ~/.bashrc

③ Because Hive needs the support of Mysql, that is, the jar package of mysql connection is needed, copy mysql-connector-java-version number-bin.jar to the lib directory of HIVE_HOME to support MySQL operation. It is best to create a Hive role in Mysql and grant corresponding permissions.

 

Hive SQL VS SQL

 

The difference between Hive SQL and SQL is shown in the figure above. First of all, the problem of data storage, SQL is stored in the local file system, and HQL can be stored in HDFS and Hbase. Secondly, HQL is scalable, that is, you can manually implement the functions of UDF, UDAF, and UDTF.

UDF - direct application and select statement, common such as case conversion. It can be understood as a one-to-one relationship.

UDAF - many to one situation, common in wordCount, Group by stage.

UDTF - one-to-many case

How to manually implement a UDF, UDAF or UDTF function will be written later.

In addition, the understanding of the read-time mode and write-time mode of data inspection:

The read-time mode is to check the data when Hive is reading, and perform field analysis, schema (data expression), metadata, data storage location, etc. The biggest advantage is that the data is loaded very quickly. In the process of writing There is no need to parse the data in .

The write-time mode is to optimize when reading data, and the corresponding checks such as indexing, compression, data consistency, field checking, etc. are all implemented when writing, so the writing speed is very slow.

Compared with traditional numerical relational databases, Hive can also be analyzed through the above comparison, and has the following differences:

1) The systems for storing files are different. Hive uses Hadoop's HDFS, while the relational database is the local file system of the server.

2) The computing model used by Hive is MapReduce, while the relational database is a computing model designed by itself.

3) Relational databases have strong real-time performance, while Hive is designed for massive data mining and has poor real-time performance.

4) Hive is easy to expand its storage capacity and computing capacity, which is inherited from Hadoop, while relational databases are weaker in this respect.

Although Hive SQL and SQL have their own differences, they also have their own application scenarios. For example, SQL is suitable for real-time query business (coarse-grained) design, while Hive SQL is used for data mining or data warehouses.

Hive Architecture

System Architecture Diagram 1

It mainly includes three parts - client, Driver, MetaStore.

Cli: Client, which can interactively execute SQL with Hive and directly interact with Drvier, which is the interface similar to the CMD window that we often use.

JDBC/ODBC: Hive provides the JDBC driver. As a Java API, JDBC is accessed through the Thift Server, and then sent to the Cli tool provided by DriverHive.

MetaStore: It is an independent relational data that stores table schema and other system metadata.

Driver: optimize the calculation of the demand through the driver module, and then execute according to the specified steps (usually start a MR task execution)

Submit a Client through the UI interface or a black and white window to execute the task. First, the UI will conduct a visit with the Driver. After the visit, a compilation (Compiler) will be performed. When compiling, metadata (Metastore) needs to be queried. If there is no metadata, an error will be reported directly, such as selecting a certain column. If this column in the table If the field does not exist, it should directly report that the column does not exist. Querying metadata will actually query the data in Mysql. After the data we need is queried through Mysql, it will be handed over to the Execution Engine, and it will perform a series of operations. Of course, it does not execute directly, but submits tasks to JobTracker, and JobTracker manages resources and schedules tasks, that is, distributes the submitted tasks to Map nodes and Reduce nodes. The corresponding Operation Tree is generated in Map and Reduce. Data is needed here. The previous ones are all logic, so data needs to be obtained from HDFS. First, it will communicate with NameNode to obtain the DataNode with stored data, and then return the data to Map Execute with Reduce (the execution flow is shown in the figure below).

 

Guess you like

Origin blog.csdn.net/qq_35363507/article/details/116739308