No care tools, Hive can automatically process SQL in the big data platform

Abstract: Is there an easier way to run SQL directly on the big data platform?

This article is shared from Huawei Cloud Community " Hive Execution Principle ", author: JavaEdge.

MapReduce simplifies the difficulty of big data programming, making big data computing no longer an unattainable technical temple, and ordinary engineers can also use MapReduce to develop big data programs. However, for people who often need to perform big data calculations, such as data analysts engaged in business intelligence (BI) research, they usually use SQL for big data analysis and statistics, and MapReduce programming still has a certain threshold. Moreover, if a corresponding MapReduce program is developed for each statistics and analysis, the cost is indeed too high.

Is there an easier way to run SQL directly on the big data platform?

Let's first look at how to implement SQL data analysis with MapReduce.

How MapReduce implements SQL

A common SQL analysis statement, how to program MapReduce?

 SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;

Statistical analysis statement to count the interests and preferences of users of different ages visiting different web pages, specific data input and execution results:

  • On the left, the data table to be analyzed
  • On the right, the analysis results

Sum the same rows in the left table to get the right table, similar to WordCount calculation. The SQL MapReduce calculation process, according to the MapReduce programming model

  • The input K and V of the map function mainly depends on V

V is the data of each row in the left table, such as <1, 25>

  • The output of the map function is to use the input V as K, and V is uniformly set to 1

For example <<1, 25>, 1>

After the output of the map function is shuffled, the same K and its corresponding V are put together to form a <K, V set>, which is sent to the reduce function as input for processing. For example, <<2, 25>, 1> is output twice by the map function, then when it comes to reduce, it becomes the input <<2, 25>, <1, 1>>, where K is <2, 25>, The V set is <1, 1>.

Inside the reduce function, all the numbers in the set V are added and output. So the output of reduce is <<2, 25>, 2>

In this way, a SQL is calculated by MapReduce.

In the data warehouse, SQL is the most commonly used analysis tool. Since a SQL can be implemented by a MapReduce program, is there any tool that can automatically generate MapReduce code from SQL? In this way, data analysts can automatically generate MapReduce executable code as long as they input SQL, and then submit it to Hadoop for execution. This is the Hadoop big data warehouse Hive.

Hive Architecture

Hive can directly process the SQL we input (Hive SQL syntax is slightly different from database standard SQL), and call the MapReduce computing framework to complete data analysis operations.

Submit SQL commands to Hive through Hive Client (Hive's command line tool, JDBC, etc.):

  • If it is DDL, Hive will record the information of the data table in the Metastore metadata component through the execution engine Driver, which is usually implemented with a relational database, and records the metadata of these databases such as table name, field name, field type, and associated HDFS file path. information
  • If it is DQL, the Driver will submit the statement to its own compiler, Compiler, for a series of operations such as syntax analysis, syntax parsing, and syntax optimization, and finally generate a MapReduce execution plan. Then generate a MapReduce job according to the execution plan and submit it to the Hadoop MapReduce computing framework for processing.

For a simple SQL command:

 SELECT * FROM status_updates WHERE status LIKE ‘michael jackson’;

Its corresponding Hive execution plan:

There are many functions preset in Hive. The Hive execution plan generates the DAG (directed acyclic graph) of these functions according to the SQL statement, and then encapsulates them into the map and reduce functions of MapReduce. The map function in this case calls three Hive built-in functions, TableScanOperator, FilterOperator, and FileOutputOperator, to complete the map calculation without the need for a reduce function.

How Hive implements the join operation

In addition to simple aggregation (group by) and filtering (where), Hive can also perform join (join on) operations.

The data in the pv_users table cannot be directly obtained in practice, because the pageid data comes from the user access log. Each user browses a page once, and an access record is generated and saved in the page_view table. The age information is recorded in the user table user.

Both tables have the same field userid, according to which the two tables can be connected to generate the pv_users table of the previous example:

 SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid);

The SQL command can also be converted into a MapReduce calculation. The connection process is as follows:

The MapReduce calculation process of join is slightly different from the previous group by. Because join involves two tables and comes from two files (folders), it needs to be marked when the map is output. For example, the output value from the first table is recorded. is <1, X>, where 1 means that the data comes from the first table. In this way, after shuffling, the same key is input to the same reduce function, and the Cartesian product of the Value data can be calculated according to the mark of the table, and each record of the first table is connected with each record of the second table. The output is the result of the join.

So open the Hive source code and look at the join related code, and you will see a two-layer for loop that joins the records from the two tables.

Summarize

There is no need to write MapReduce programs frequently for development, because the main big data processing of websites is SQL analysis, so Hive is very important in big data applications.

With the popularity of Hive, we have a stronger demand for executing SQL on Hadoop, and the application scenarios of big data SQL have also diversified, so we have developed various big data SQL engines.

Cloudera developed Impala, an MPP-based SQL engine running on HDFS. Unlike MapReduce, which starts Map and Reduce execution processes and divides the calculation process into two stages for calculation, Impala deploys the same Impalad process on all DataNode servers, and multiple Impalad processes cooperate with each other to complete SQL calculations. In some statistical scenarios, Impala can achieve ms-level computing speed.

Later, Spark was born, and it also launched its own SQL engine Shark, namely Spark SQL, which parses SQL statements into Spark execution plans and executes them on Spark. Since Spark is much faster than MapReduce, Spark SQL is also much faster than Hive, and with the popularity of Spark, Spark SQL is gradually accepted by people. Later, Hive launched Hive on Spark, which converts Hive's execution plan into Spark's computing model.

We also hope to execute SQL in NoSQL. After all, SQL has been developed for decades and has accumulated a large number of users. Many people are used to using SQL to solve problems. So Saleforce introduced Phoenix, an SQL engine that runs on HBase.

These SQL engines only support SQL-like syntax, and cannot support standard SQL like databases. In particular, in the field of data warehouses, nested query SQL is almost bound to be used: nested select subqueries in where conditions, but almost all big data SQL The engine does not support it. However, users accustomed to traditional databases hope that big data can also support standard SQL.

Back to Hive. There is no innovation in the technical architecture of Hive itself. The database-related technologies and architectures are very mature. As long as these technical architectures are applied to MapReduce, the Hadoop big data warehouse Hive is obtained. However, it is very innovative to think of grafting the two technologies together. The Hive produced by grafting greatly reduces the application threshold of big data and also makes Hadoop popular.

reference

  • https://learning.oreilly.com/library/view/hadoop-the-definitive/9781491901687/ch17.html#TheMetastore

 

Click Follow to learn about HUAWEI CLOUD's new technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/5516180