map principle diagram:
Process:
block数据: dear car bear car
Map processing data: divide according to the space, output the key-value format, the key is the character obtained by the division, and the value is 1.
Data processed by map: (dear,1),(car,1),(bear,1),(car,1)
shuffle: Put the same key into the same reduce
4 (Dear, 1) key-value pairs, converted to [Dear, Iterable(1, 1, 1, )], passed into reduce() as two parameters
Inside reduce(), calculate the total number of Dear as 4, and output (Dear, 4) as a key-value pair
Go to https://blog.csdn.net/ych0112xzz/article/details/81186204
Join in hive can be divided into Common Join (join in Reduce phase) and Map Join (join in Map phase)
hive common join
Hivesql will be translated into mr task and run on yarn
mapJoin
Map stage
- Read the data of the source table, and use the column in the Join on condition as the key when the Map is output. If the Join has multiple associated keys, the combination of these associated keys is used as the key;
- The value output by the Map is the column that you care about after join (needed in select or where); at the same time, the value will also contain the tag information of the table, which is used to indicate which table this value corresponds to;
- Sort by key
- Read the data of the source table, and use the column in the Join on condition as the key when the Map is output. If the Join has multiple associated keys, the combination of these associated keys is used as the key;
- The value output by the Map is the column that you care about after join (needed in select or where); at the same time, the value will also contain the tag information of the table, which is used to indicate which table this value corresponds to;
- Sort by key
Shuffle stage
Hash according to the value of the key, and push the key/value to different reduce according to the hash value, so as to ensure that the same key in the two tables is in the same reduce
Reduce phase
The join operation is completed according to the value of the key, during which the data in different tables is identified through the Tag.
Take the following HQL as an example to illustrate the process:SELECT a.id,a.dept,b.age FROM a join b ON (a.id = b.id);
hive-map-join Assuming that table a is a large table, b is a small table, and hive.auto.convert.join=true, then Hive will be automatically converted to MapJoin when it is executed.
Reason: It saves a large amount of data transmission during the Shuffle phase, and eliminates the need for reduce operation
- As shown in the process, the first is Task A, which is a Local Task (Task executed locally on the client), which is responsible for scanning the data of small table b, converting it into a HashTable data structure, and writing it to a local file After loading the file into DistributeCache , the data structure of the HashTable can be abstracted as:
|key| value|
| 1 | 26 |
| 2 | 34 |- -Next is Task B. This task is an MR without Reduce. Start MapTasks to scan large table a. In the Map phase, associate each record of a with the HashTable corresponding to table b in DistributeCache and directly output the result.
-Since MapJoin does not have Reduce, the result files are directly output by Map. There are as many result files as there are Map Tasks.
Principle of hive group by
https://blog.csdn.net/u013668852/article/details/79866931
Group by multi-field Use the combination of groupby's fields as the key value of the map
select rank, isonline, count(*) from city group by rank, isonline;
Combine the fields of GroupBy into the output key value of the map, use MapReduce sorting, and save the LastKey in the reduce phase to distinguish different keys. The MapReduce process is as follows (of course, this is just to illustrate the non-Hash aggregation process on the Reduce side)