Big Data Technology Hive: Pilot Chapter (1)

Table of contents

1. What is Hive

2. Think about how to design Hive functions

2.1 Questions

2.2 Case analysis

2.3 Summary

3. Master Hive’s infrastructure

3.1 Hive components - metadata storage

3.2 Hive component - Driver

3.3 Hive components - user interface


1. What is Hive

What is distributed SQL computing

We know that when performing data statistical analysis, it is usually a programming language (such as Java, Python) + SQL, indicating thatSQL is currently the most convenient programming tool for data statistical analysis< /span>.

Big data systems are filled with many statistical analysis scenarios. Therefore, using SQL to process data is also in great demand in big data.

But what we learned earlier is very importantMapReduce, which only supports program development (Java, Python, etc.),< a i=3>SQL development is not supported.

Therefore, although MapReduce is very important and has high computational efficiency, it is very complicated to use because it does not support SQL development.

From this, Hive came into being.

What is Hive

Apache Hive is adistributed SQL computing tool. Its main functions are:

Translate SQL statements into MapReduce programs and run them

 Benefits of Hive

Problems faced when using Hadoop MapReduce to directly process data:

  • The cost of personnel learning is too high. You need to master programming languages ​​​​such as Java and Python.
  • It is too difficult to develop complex query logic with MapReduce

Benefits of using Hive to process data

  • The operation interface uses SQL-like syntax to provide rapid development capabilities (simple and easy to use)
  • The bottom layer executes MapReduce, which can complete SQL processing of distributed massive data.

2. Think about how to design Hive functions

2.1 Questions

If you were asked to design Hive software, the requirements would be able to achieve

  1. Users only write sql statements
  2. Hive automatically converts SQL to MapReduce program and submits it for running
  3. Process structured data located on HDFS.

How to achieve?

2.2 Case analysis

针对SQL:SELECT city, COUNT(*) FROM t_user GROUP BY city;

If translated into a MapReduce program, there are the following problems:

  • Where are the data files?
  • What symbol is used as column separator?
  • Which columns can be used as city?
  • What type of data is the city column?

Let’s analyze them one by one:

Where are the data files?

Based on this "given SQL" alone, how do you know where the data files are located?

We might as well learn from a database (such as MySQL database) on this point, which can locate the storage location of the data file through the SQL statement.

Similarly,What symbol is used as the column separator? Which columns can be used as city? What type of data is the city column? These problems can be solved through internal mapping relationships in MySQL.

Then, the simplest way is to find a database and let it manage our data. We call itmetadata management.

Metadata management

Therefore, the functions of metadata management are:

Solve problems such as data location and data structure, and describe and record the data.

SQL parser

After solving the metadata management, we still have a crucial step, which iscomplete the SQL to MapReduce conversion function.

This feature, we call it SQL parser, expects it to do:

  • SQL analysis
  • SQL to MapReduce program conversion
  • Submit the MapReduce program to run and collect execution results

Note: The SQL parsing mentioned here is not the same thing as the SQL parser of database tools (such as Mysql). MySQL's SQL parser is an internal tool for parsing SQL syntax. The SQL parser here is designed for the Hive tool. The purpose is to allow Hive SQL to be parsed and converted into a MapReduce program, and it can also analyze SQL and so on. Because there is a difference between Hive's SQL and database's SQL.

As for the similarities and differences between Hive's SQL and database's SQL, and how to make MySQL recognize Hive's SQL syntax, I will talk about this later ~

Therefore, after we have the parser, we have completed the basic construction of a distributed SQL execution engine based on MapReduce.

2.3 Summary

The two main components of Apache Hive are:SQL parser and metadata storage, as shown below.

3. Master Hive’s infrastructure

3.1 Hive components - metadata storage

is usually stored in a relational database such as mysql/derby . Metadata in Hive includes the name of the table, the columns and partitions of the table and their attributes, the attributes of the table (whether it is an external table, etc.), the directory where the table's data is located, etc.

-- Hive provides the Metastore service process to provide metadata management functions

3.2 Hive component - Driver

That is, the SQL parser, including syntax parser, plan compiler, optimizer, and executor.

effect

Complete HQL Query statement from lexical analysis, syntax analysis, compilation, optimization and query plan generation.

The generated query plan is stored in HDFS and subsequently executed by execution engine calls.

This part of the content is not a specific service process, but is encapsulated in the Jar file that Hive depends on, that is, the Java code.

3.3 Hive components - user interface

Includes CLI, JDBC/ODBC, WebGUI. Among them, CLI (command line interface) is the shell command line; the Thrift server in Hive allows external clients to interact with Hive through the network, similar to the JDBC or ODBC protocol. WebGUI accesses Hive through a browser.

-- Hive provides Hive Shell, ThriftServer and other service processes to provide users with operation interfaces

The next chapter will explain the installation and deployment of Apache Hive and hello world   

Guess you like

Origin blog.csdn.net/YuanFudao/article/details/132823206