Precautions hive.groupby.skewindata = true

As SQL and, likewise supported HiveQL DISTINCT operation, the following example:

(1) SELECT
count(DISTINCT uid) FROM log

(2) SELECT ip, count(DISTINCT uid) FROM log GROUP BY ip

(3) SELECT ip, count(DISTINCT uid, uname) FROMlog GROUP BY ip

(4) SELECT ip, count(DISTINCTuid), count(DISTINCT uname) FROMlog GROUP BY ip

When we go heavy use keyword DISTINCT Hive QL, you need to note is:

Hive deduplication operation environment variables in a plurality of columns hive.groupby.skewindata relationship exists.
When hive.groupby.skewindata = true, hive do not support the multi-column operation to heavy
, and given:
Error in semantic analysis: DISTINCT on different columns notsupported with skew in data.
Note: The above example (3) does not belong to a plurality of rows on the DISTINCT operations.

Group By statement
• Map end portion of the polymerization:
• Not all polymerization operations need to be completed in Reduce end, a lot of polymerization operations can be carried out first partially polymerized in the Map end, the final result in the conclusion that Reduce end.
• based on the Hash
• parameters include:
whether • hive.map.aggr = true end of the polymerization in the Map, is True default
number of entries = 100000 polymerization operation ends • hive.groupby.mapaggr.checkinterval Map of

• When there is data skew load balancing
• hive.groupby.skewindata = false
• When the option is set to true, the resulting query plans have two MR Job. The first MR Job, the output will be a set of randomly distributed Map to Reduce each polymerization operation Reduce do section, and outputs the result, a result of this process is the same as Group By Key likely to be distributed to the different Reduce , so as to achieve load balancing purposes; second MR Job Group ByKey then distributed according to the result of the Reduce (this process ensures that the same Group by Key are distributed to Reduce the same) according to the pre-processed data, and finally complete the final polymerization operation.

hive.groupby.skewindata variables
can be seen from the above statements groupby, this variable is used to control load balancing. When the data appears inclined, if the variable is set to true, the Hive will automatically load balancing.

HIVE-2416
Currently when multiple distinct function is used,hive.groupby.skewindata optimization parameter shall be set false,or else an exception is raised:
Error in semantic analysis: DISTINCT on different columns not supported with skew in data
Skew groupby should support multiple distinct function

Guess you like

Origin www.cnblogs.com/QFKing/p/11869337.html