The connection between mapReduce and hive

map principle diagram:

 

Process:

block数据: dear car bear car

Map processing data: divide according to the space, output the key-value format, the key is the character obtained by the division, and the value is 1.

Data processed by map: (dear,1),(car,1),(bear,1),(car,1)

shuffle: Put the same key into the same reduce

4 (Dear, 1) key-value pairs, converted to [Dear, Iterable(1, 1, 1, )], passed into reduce() as two parameters

Inside reduce(), calculate the total number of Dear as 4, and output (Dear, 4) as a key-value pair

Go to  https://blog.csdn.net/ych0112xzz/article/details/81186204 
Join in hive can be divided into Common Join (join in Reduce phase) and Map Join (join in Map phase)

hive common join

Hivesql will be translated into mr task and run on yarn

mapJoin  

Map stage

  • Read the data of the source table, and use the column in the Join on condition as the key when the Map is output. If the Join has multiple associated keys, the combination of these associated keys is used as the key;
  • The value output by the Map is the column that you care about after join (needed in select or where); at the same time, the value will also contain the tag information of the table, which is used to indicate which table this value corresponds to;
  • Sort by key
  • Read the data of the source table, and use the column in the Join on condition as the key when the Map is output. If the Join has multiple associated keys, the combination of these associated keys is used as the key;
  • The value output by the Map is the column that you care about after join (needed in select or where); at the same time, the value will also contain the tag information of the table, which is used to indicate which table this value corresponds to;
  • Sort by key

Shuffle stage

Hash according to the value of the key, and push the key/value to different reduce according to the hash value, so as to ensure that the same key in the two tables is in the same reduce

Reduce phase

The join operation is completed according to the value of the key, during which the data in different tables is identified through the Tag. 
Take the following HQL as an example to illustrate the process:

SELECT
a.id,a.dept,b.age
FROM a join b
ON (a.id = b.id);

è¿éåå¾çæè¿ °

    

hive-map-join Assuming that table a is a large table, b is a small table, and hive.auto.convert.join=true, then Hive will be automatically converted to MapJoin when it is executed.

Reason: It saves a large amount of data transmission during the Shuffle phase, and eliminates the need for reduce operation

è¿éåå¾çæè¿ °

  • As shown in the process, the first is Task A, which is a Local Task (Task executed locally on the client), which is responsible for scanning the data of small table b, converting it into a HashTable data structure, and writing it to a local file After loading the file into DistributeCache , the data structure of the HashTable can be abstracted as:
    |key| value| 
    | 1 | 26 | 
    | 2 | 34 |
  • -Next is Task B. This task is an MR without Reduce. Start MapTasks to scan large table a. In the Map phase, associate each record of a with the HashTable corresponding to table b in DistributeCache and directly output the result. 
    -Since MapJoin does not have Reduce, the result files are directly output by Map. There are as many result files as there are Map Tasks.

Principle of hive group by

https://blog.csdn.net/u013668852/article/details/79866931

Group by multi-field Use the combination of groupby's fields as the key value of the map

select rank, isonline, count(*) from city group by rank, isonline;

Combine the fields of GroupBy into the output key value of the map, use MapReduce sorting, and save the LastKey in the reduce phase to distinguish different keys. The MapReduce process is as follows (of course, this is just to illustrate the non-Hash aggregation process on the Reduce side)

image

Guess you like

Origin blog.csdn.net/qq_24271537/article/details/113359842