SQL Join importance after the connection table size preceding (first small table to improve the efficiency)

Refers to the address: https: //blog.csdn.net/qq_30349961/article/details/82662550

     http://blog.sina.com.cn/s/blog_6ff05a2c01016j7n.html

Often see some suggestions to optimize the Hive said that when a small table with a large table do associate, the small table EDITORIAL, this could make the association faster speed Hive, said that because of the reasons mentioned are the small table can first put into memory, and then recorded each big table again detected in memory, the final completion of related queries. The reason for this seems reasonable, but carefully, and with untenable.

Count how small table small table? If the so-called small table does not fit in the memory of how to do? I use two tables do only a few records associated with the query, this should be a small table, and looking at the implementation of the log is still reduce the write operation of the disk. In fact, reduce the reception of all there will be a sort of map output after all keys and operation of the disk file merge written. Written to disk ( spill) there may be many times, it is likely to generate multiple temporary files, but will eventually be merged into one file, that ultimately each reduce only one file.

I did an experiment with a table and a record of the table more than 300 million records do join, whether it is on the small table in front of join or later join, the time to perform almost all the same. Then go to reduce the execution log, a record in the table before join or two after the join query reduce log is almost exactly the same. If you follow the above statement the table join the left side put the memory wait join the right side of the table into memory to detect, then when the table on more than 300 million records when the left join, the memory is certainly not only put so much recorded is bound to write to disk, then its execution time would be better than a small table in many fishes long time before join, but that is not the case, it means that the causes of the above mentioned and unreasonable.

In fact , " the little table in front of the association can do to improve efficiency ," This statement is more correct to say that " the small table on the duplicate key association join the previous association can do to improve join efficiency. "

Analyze the Hive for two related tables in the underlying is how to achieve. Because no matter how complex Hive queries, will eventually be converted into mapreduce the JOB to perform, so the Hive to be associated and to achieve mapreduce to achieve a similar association. And mapreduce for achieving association, simply, the keys and indicia are associated with the identification bit left or right as the join key combination (Key), and the record mark is a combination of a join flag left or right value (value). In the reduce shuffle stage, the associated key according to the primary sort key combinations when the associated key is the same, then according to the secondary sort flag. In the partition section, only with the associated key in association key in sections, so that the same record will be associated in the same key value list while ensuring the recording table join the left in front of the value list, and recording the right table in the join behind the value list.

E.g. A join B ON (A.id = b.id ), assuming Table A and Table B has a recording id = 3, then the key combinations in Table A of this record is the (3,0), Table B which record key combination is (3,1). Can guarantee sort records in table A previously recorded in Table B. While when doing reduce treatment, the ID = 3 in the same value list form  , value list = [A = Table 3 Key ID = 3 is recorded in Table B, ID = 3, record ]

Next we look at when the two tables do associate reduce what has been done. Reduce together will handle all record the same id. We value list is represented by an array.

1) Reduce to read the first record v [0], found that if v [0] is recorded in Table B, it means that there is no record in table A, the final output will not be associated, so no need to continue the process id, and reading v [0] with a 1 reading operation.

2) If it is found v [0] to v [length-1] are all recorded in Table A, it means that there is no record of Table B, the same is not associated with the final output, it is noted here that, already value done read length times fetch operations.

3) For example A table id = 3 has a record, B Table id = 3 there are 10 records. First reads v [0] is found recorded in Table A, with a secondary reading operation. Then reads v [1] is found that the operation table B, then v [0] and v [1] can be directly related to the output, integrated with the second operation. This time reduce already know v [1] is started later B records the table, it can be directly v [0] sequentially, and v [2], v [3 ] ...... v [10] and outputs the operation Correlative , with a total of 11 operations.

4) change over, assuming A table id = 3 there are 10 records, B Table id = 3 has a record. First reads v [0] is found recorded in Table A, with a secondary reading operation. Then reads v [1] is found to remain recorded in Table A, with a total of 2 times the reading operation. So, the read v found [9], or when the recording sheet A, with a total of 10 times the reading operation. Then read the last one recorded v [10] found that is recorded in Table B, may be v [0] and v [10] associating the output with the total 11 operations. Directly next to v [1] ~ v [9 ] respectively associating the output v [10], with a total of 20 operations.

5) a little more complex, assuming A table id = 3 there are two records, B Table id = 3 There are 5 records. First reads v [0] is found recorded in Table A, with a secondary reading operation. Then reads v [1] is found to remain recorded in Table A, with a total of 2 times the reading operation. Then reads v [2] is found recorded in Table B, then v [0] and v [2] may be directly associated with the output, with a total of three operations. Next v [0] may be sequentially and v [3] ~ v [6 ] for an associated output, with the total of 7 operations. Next, v [1] in turn, and v [2] ~ v [6 ] for an associated output, with the total of 12 operations.

6) The Example 5 is transferred over, assuming A table id = 3 There are 5 records, B Table id = 3 there are two records. First read v [0] is found recorded in Table A, with a secondary reading operation. Then reads v [1] is found to remain recorded in Table A, with a total of 2 times the reading operation. So, the read v [4] found that is still recorded in Table A, with a total of five times a read operation. Then reads v [5], is found recorded in Table B, then v [0] and v [5] can be directly associated with the output, with the total of six operations. And v [0] and v [6] for an associated output, with the total of 7 operations. Then v [1], respectively, and the V [. 5], V [. 6] associated with the output, with the total of 9 operations. V [2], respectively, and V [. 5], V [. 6] associated with the output, with the total of 11 operations. And so on, and finally v [4], respectively, and the V [. 5], V [. 6] associated with the output, with the total of 15 operations.

7) Additional mention, when reduce the detection time of the recording table A are also recorded in Table A with a number of records of the key, was found when the same number of records than the key value hive.skewjoin.key (default 1,000,000 time), will print out the log reduce key, and associated key labeled inclined.

The final conclusion is: written on the left side of the associated table for every one bottom Article duplicate association key will be more a secondary arithmetic processing.

Suppose A table has ten million id, average id has three duplicate values, then the A correlation table in front of it will do to do arithmetic processing times thirty million, who has written this time in the front who wrote see the difference after the performance came.

Guess you like

Origin www.cnblogs.com/bgh408/p/11646286.html