"Offline and real-time big data development combat" (4) Hive principle practice

Preface

We all know that Hive SQL is actually translated into MapReduce for execution, so what is the specific process? Today we will explore the execution mechanism and principles behind Hive SQL.

A further understanding and mastery of the execution principle of Hive SQL is very important for the development and optimization of usual offline tasks, and is directly related to the execution efficiency and time of Hive SQL.

1. Hive basic architecture

As a major data warehouse solution based on Hadoop, Hive SQL is the main interactive interface. The actual data is stored in HDFS files. The actual calculation and execution are done by MapReduce. The bridge between them is the Hive engine.

Hive engine architecture
The main components of Hive include UI components, Driver components (Complier Optimizer Executor), Metastore components, CLI (Command Line Interface), JDBC/ODBC, Thrift Server and Hive Web Interface (HWI), etc.

Hive is to receive related Hive SQL queries through CLI, JDBC/ODBC or HWI, and compile, analyze and optimize them through the Driver component, and finally turn them into executable MapReduce.

The execution process of Hive main components

二、Hive SQL

Hive SQL is a SQL language similar to the ANSI SQL standard, but the two are not exactly the same. The SQL language of Hive SQL and MySQL is the closest, but there are also significant differences between the two. For example, Hive does not support row-level data insertion, update, and deletion, nor does it support transactions.

Hive key concepts

1. Hive database

The database in Hive is essentially just a directory or namespace, but for clusters with many users and groups, this concept is very useful.

First of all, this can avoid table naming conflicts; second, it is equivalent to the database concept in relational databases, which is a set of tables or logical groups of tables, which is very easy to understand.

2. Hive table

The table in Hive and the table in relational database are conceptually similar. Each table has a corresponding directory to store data in Hive. If there is no database for the specified table, then Hive will use the {HIVE_HOME} /conf/hive-site.xmlconfiguration file in the hive.metastore.warehouse.dirThe attribute uses the default value (generally /user/hive/warehouse, you can also modify the configuration according to the actual situation), and all table data (excluding external tables) are stored in this directory.

Hive tables are divided into two categories, namely internal tables and external tables. The so-called internal table (managed table) refers to the table managed by Hive. The management of Hive internal table includes both logical and grammatical and actual physical meaning, that is, when the Hive internal table is created, the data will actually exist in the directory where the table is located. Inside, when the internal table is deleted, the physical data and files are also deleted.

So whether to choose an internal table or an external table?

In most cases, the difference between the two is not obvious. If all data processing is performed in Hive, then internal tables are preferred. But if Hive and other tools deal with the same data set, then external tables are more suitable.

  • A common pattern is to use an external table to access the initial data stored in HDFS (usually created by other tools), then use Hive to transform the data and place the result in the internal table. On the contrary, the external table can also be used to export the processing results of Hive for other applications.

  • Another scenario for using external tables is to associate multiple schemas for a data set.

3. Partitions and buckets

Hive divides the table into partitions, and partitions are performed according to the partition field. Partitioning can make partial queries of data faster. Tables or partitions can be further divided into buckets. Buckets usually add some additional structures to the original data, which can be used for efficient queries.

For example, user ID-based bucketing can make user-based queries very fast.

(1) Partition

Assume that in the log data, each record has a timestamp. If you partition according to time, the data of the same day will be divided into the same partition.

Partitioning can be done in multiple dimensions. For example, after dividing by date, it can be further divided according to country.

Example of the physical structure corresponding to the Hive partition
Partition used when creating a table of PARTITIONED BYclause defined, this clause to receive a list of fields:

CREATE TABLE logs (ts BIGINT , line STRING)
PARTITIONED BY (dt STRING,country STRING);

When importing data into a partition table, the value of the partition is explicitly specified:

LOAD DATA INPATH ’/user/root/path’ 
INTO TABLE logs 
PARTITION (dt='2001-01-01',country='GB’);

The actual SQL, flexible specify the partition will greatly enhance its efficiency, the following code will only be scanned 2001-01-01in GBthe directory.

SELECT ts , dt , line FROM logs WHERE dt=2001-01-01' and country='GB' 
(2) Divide buckets

There are usually two reasons for using buckets in tables or partitions:

  • One is for efficient query. The bucket adds special results to the table, and Hive can use these structures to improve efficiency when querying. For example, if two tables are bucketed according to the same field, when associating the two tables, map-side association can be used for efficient implementation, provided that the associated field appears in the bucketing field.
  • Second, sampling can be performed efficiently. When analyzing large data sets, it is often necessary to observe and analyze part of the sampled data. The bucketing is conducive to efficient sampling.

To make Hive table is divided barrel by CLUSTERED BYspecified when creating a table of clauses:

CREATE TABLE bucketed users(id INT, name STRING) 
CLUSTERED BY (id) INTO 4 BUCKETS;

The specified table is bucketed according to the id field and divided into 4 buckets. When dividing buckets, Hive decides which bucket the data should be placed in according to the remainder after field hashing, so each bucket is a random sample of the overall data.

In the map-side association, the two tables are bucketed according to the same field. Therefore, when processing the bucket of the left table, you can directly extract data from the bucket corresponding to the outer table for association operations. The two tables associated with the map-side need not necessarily have exactly the same number of buckets, as long as they are multiples.

It should be noted that Hive does not verify whether the data meets the bucketing in the table definition, and will report an error only when an exception occurs during the query. Therefore, a better way is to divide the barrel of work to be done Hive (set hive.enforce.bucketingproperty is true you can).

Hive DDL

1. Create a table

  • CREATE TABLE: Used to create a table with a specified name. If the table with the same name already exists, the user can use the IF NOT EXIST option to ignore the exception.
  • EXTERNAL: This keyword allows users to create an external table and specify a path to the actual data (LOCATION) while creating the table.
  • COMMENT: You can add descriptions for tables and fields.
  • ROW FORMAT: Users can customize SerDe or use the built-in SerDe when building tables.
  • STORED AS: If the file data is plain text, use STORED AS TEXTFILE; if the data needs to be compressed, use STORED AS SEQUENCE.
  • LIKE: Allow users to copy the existing table structure, but not copy data.
hive> CREATE TABLE empty key value store 
LIKE key value store;

You can also create a table by CREATE TABLE AS SELECT, an example is as follows:

Hive> CREATE TABLE new key value store 
	ROW FORMAT 
SERDE "org.apache.Hadoop.hive.serde2.columnar.ColumnarSerDe" 
	STORED AS RCFile 
	AS 
SELECT (key % 1024) new_key, concat(key, value) key_value_pair 
FROM key_value_store 
SORT BY new_key, key_value_pair;

2. Modify the table

The syntax for modifying the table name is as follows:

hive> ALTER TABLE old_table_name RENAME TO new_table_name;

The syntax for modifying column names is as follows:

ALTER TABLE table_name CHANGE (COLUMN) old_col_name new_col_name column_type 
[COMMENT col_comment) (FIRST|AFTER column_name)

The above syntax allows you to change the column name, data type, comment, column position, and any combination of them. If you want to add a new column after building a table, use the following syntax:

hive> ALTER TABLE pokes ADD COLUMNS (new_col INT COMMENT 'new col comment');

3. Delete table

The DROP TABLE statement is used to delete table data and metadata. For external tables, only the metadata in the Metastore is deleted, and the external data is not stored. Examples are as follows:

drop table my_table;

If you only want to delete table data and retain the table structure, similar to MySQL, use the TRUNCATE statement:

TRUNCATE TABLE my_table;

4. Insert Table

(1) Load data into the table

Examples of relative paths are as follows:

hive> LOAD DATA LOCAL INPATH ’./exarnples/files/kvl.txt ’ OVERWRITE INTO 
TABLE pokes;
(2) Insert the query result into Hive

Write the query result to the HDFS file system.

INSERT OVERWRITE TABLE tablenamel [PARTITION (partcoll=val1, partcol2=val2 ... )] 
select_statement1 FROM from_statement

This is the basic mode, and there are multiple insertion modes and automatic partitioning modes, which are not described here.

Hive DML

1. Basic select operation

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference 
[WHERE where_condition] 
[GROUP BY col_list [HAVING condition]] 
[ CLUSTER BY col_list 
| [DISTRIBUTE BY col_list] [SORT BY | ORDER BY col_list] 
]
[LIMIT number]
  • Use ALL, DISTINCT options to distinguish the treatment of duplicate records. The default is ALL, which means to query all records, DISTINCT means to remove duplicate records
  • WHERE condition: similar to the where condition of traditional SQL, supports AND, OR, BETWEEN, IN, NOT IN, etc.
  • The difference between ORDER BY and SORT BY: ORDER BY refers to global sorting, there is only one Reduce task, while SORT BY only sorts on the machine
  • LIMIT: You can limit the number of records to be queried, such as SELECT * FROM tl LJMIT5, and Topk query can also be implemented. For example, the following query statement can query the 5 sales representatives with the most sales records
SET mapred.reduce.tasks = 1 
SELECT * FROM test SORT BY amount DESC LIMIT 5
  • REGEX Column Specification: The select statement can use regular expressions to select columns. The following statement queries all columns except ds and hr
SELECT `(ds|hr)?+.` FROM test

2. join 表

join_table:
table_reference (INNER] JOIN table_factor (join_condition]
| table_reference {LEFTIRIGHTjFULL} (OUTER] JOIN table_reference join_ condition
| table_reference LEFT SEM JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference (join_condition] (as of Hive 0.10)
table reference:
table_factor
| join_table
table_factor:
tbl_name [alias]
| table_subquery alias
| (table_references)
join_condition:
on expression
  • Hive only supports equal joins, outer joins and left semi joins (non-equivalent joins are supported since version 2.2.0);
  • You can connect more than two tables, such as:
select a.val, b.val,c.val 
from a 
join b 
on (a.key=b.key1) 
join c 
on(c.key = b.key2);
  • If the join key of multiple tables in the connection is the same, the connection will be converted into a single Map/Reduce task
select a.val,b.val,c.val 
from a 
join b 
on (a.key=b.key1) 
join c 
on(c.key=b.key1);
  • Put the big table at the end when join: Reduce will cache the records of all tables except the last table in the join sequence, and then serialize the results to the file system through the last table
  • If you want to limit the output of join, you should write the filter condition in the where clause, or write it in the join clause.
  • However, there are table partitions. For example, as shown in the first SQL statement below, if the record corresponding to table c is not found in table d, all columns of table d will list NULL, including column ds. In other words, join will filter all records in table d that cannot be found matching the join key of table c. In this way, LEFT OUTER makes the query result irrelevant to the WHERE clause. The solution is to specify the partition when joining (as shown in the second SQL statement below)
--第一个 SQL 语句
SELECT c.val, d.val FROM c LEFT OUTER JOIN d ON (c.key=d.key) 
WHERE a.ds='2010-07-07' AND b.ds='2010-07-07'
-- 第二个 SQL 语句
SELECT c.val, d.val FROM c LEFT OUTER JOIN d 
ON (c.key=d.key AND d.ds=2009-07-07AND c.ds='2009-07-07')
  • Left semi join is a more efficient implementation of the in/exists subquery. The table on the right side of the join clause can only be set in the on clause to filter conditions, and it cannot be filtered in the where clause, select clause or other methods.
	SELECT a.key, a.value
 FROM a 
	WHERE a.key in 
	(SELECT b.key FROM B); 
 --可以被重写为:
	SELECT a.key, a.val 
	FROM a LEFT SEMI JOIN b on (a.key = b.key)

Three, Hive SQL execution principle diagram

We all know that a good Hive SQL and a poorly written Hive SQL may differ in the use of underlying computing and resources by a hundred, or even a thousand, or ten thousand times.

In addition to the waste of resources, improper use of Hive SQL may run for several hours or even ten hours without obtaining results. Therefore, it is very necessary for us to deeply understand the execution process and principle of Hive SQL.

Take the group by statement execution diagram as an example:

We assume a business background: analyze the distribution of iPhone 7 customers in each city, that is, which city buys the most and which is the least.

select city,count(order_id) as iphone7_count from orders_table where day='201901010' and cat_name='iphone7' group by city;

The underlying MapReduce execution process:

Hive group by statement execution principle diagram
The group by statement of Hive SQL involves the redistribution and distribution of data, so its execution process completely includes the execution process of MapReduce tasks.

(1) Input fragment

The input file of the group by statement is still the partition file with day=20170101, and the input fragmentation process and the number are the same as the select statement, and are also divided into three fragment files with sizes of 128MB, 128MB, and 44MB.

(2) Map stage

The Hadoop cluster also starts three Map tasks to process the corresponding three shard files; each map task processes each line in its corresponding shard file, and checks whether its product category is iPhone7. If it is, the output is like < City,1> key-value pair, because the number of orders needs to be counted according to city (note the difference with the select statement).

(3) Combiner stage

  • The Combiner stage is optional. If the Combiner operation is specified, then Hadoop will execute the Combiner operation in the ground output of the Map task. The advantage is that redundant output can be removed and unnecessary subsequent processing and network transmission overhead can be avoided.
  • In this column, <hz,1> appears twice in the output of Map Task1, then the Combiner operation can merge it into <hz,2>
  • Combiner operation is risky. The principle of using it is that the output of Combiner will not affect the final input of Reduce calculation. For example, if the calculation is only to find the total, maximum and minimum, you can use combiner, but if you use Combiner for average calculation, the final Reduce calculation result will be wrong

(4) Shuffle stage

A complete shuffle includes partition, sort and spill, copy, merge and other processes.

  • For understanding the group by statement, there are actually two key processes, namely partitioning and merging; the so-called partitioning is how Hadoop decides to assign each output key-value pair of each Map task to the Reduce Task. In Reduce Task, how to merge the same key values ​​from multiple Map Tasks
  • The most commonly used partitioning method in Hadoop is Hash Partitioner, that is, Hadoop will take a hash value for each key, and then modulate the hash value according to the number of reduce tasks to obtain the corresponding reduce, so as to ensure that the same keys are definitely allocated To the same reduce, the hash function can also ensure that the output of the Map task is evenly distributed to all reduce tasks

(5) Reduce phase

Call the reduce function, and save the output of each reduce task to a local file

(6) Output file

hadoop merges the output files of the Reduce Task task into the output directory

Four, summary

We introduced the execution principle of Hive SQL. Of course, you have to know why and why. Understanding the execution principle of Hive is the prerequisite and basis for writing efficient SQL. It is also the foundation for mastering Hive SQL optimization techniques. Next, we will enter the Hive optimization practice.

Guess you like

Origin blog.csdn.net/BeiisBei/article/details/109005996