Hive data skew and the main reason Solutions

Cause data skew generated

Cause data inclined and much of the polymerization is inclined obliquely join two categories

Hive inclination of the inclined polymerizable group by

  • the reason:
    • Packet through small dimensions, each dimension value is excessive, resulting in a long process time consuming reduce certain value;
    • For some types of statistics when a particularly large amount of data to some type of other data types are extremely rare. When carried out in accordance with the type of the group by the data will reduce tasks required for the same group by the same field to pull a node polymerization, and wherein when the data of each group is too large, there will be other groups calculated has been completed and this has not reduce the calculation is completed, other nodes have been waiting for this node task execution is completed, we will always see 100% reduce99% of map case;
  • Solution:
    • set hive.map.aggr=true;
    • set hive.groupby.skewindata=true;
  • principle:
    • hive.map.aggr = true This configuration represents the open end of a polymerization map;
    • hive.groupby.skewindata = true, when the option is set to true, the resulting query plans have two MR Job. When the first MR Job, the combined output of Map Reduce randomly distributed to each polymerization operation Reduce do section, and outputs the result. The consequence of this is the same as Group By Key likely to be distributed to the different Reduce, so as to achieve load balancing purposes. The second MR Job then the data in accordance with result of the preprocessing to reduce Group By Key distribution, this process ensures that the same key is assigned to reduce the same, and finally complete the final polymerization operation.

Hive tilt of Map Reduce and optimize

  • 1- The reason: when there is too many small files, small files need to be merged. = True can be solved by a set hive.merge.mapredfiles;
  • 2- The reason: there is an input data chunks and small pieces of serious problems, such as: a large file 128M, there are 1000 small files, each 1KB. Workaround: Before you do the task input file merge, merge many small files into one large file. By solving set hive.merge.mapredfiles = true;
  • 3- reasons: single file size is slightly larger than the size of block block configuration, appropriately at this time needs to increase the number of the map. Resolution: the number of set mapred.map.tasks;
  • 4- causes: moderate size, but very large client computing map, such as: select id, count (*), sum (case when ...), sum (case when ...) ... need to add a map number. Resolution: the number of set mapred.map.tasks, set the number of mapred.reduce.tasks;

When the inclination of the HQL Hive comprising count (distinct)

  • If the amount of data is very large, the implementation of such select a, count (distinct b) from t group by a; type when sql, data will be inclined problems.
  • Workaround: Use sum ... group by instead. Such as: select a, sum (1) from (select a, b from t group by a, b) group by a;

Hive inclination of the HQL join optimization

  • When faced with a large table and a small table join operation. Use mapjoin small table is loaded into memory. Such as: select / * + MAPJOIN (a) * / a.c1, b.c1, b.c2 from a join b where a.c1 = b.c1;
  • Encountered required join, but the associated data field is null, as in Table II Table id and id needs to associate;
    • Solution 1: id is null association does not participate
      such as:
select * from log a 
 join users b 
on a.id is not null and a.id = b.id union all select * from log a where a.id is null; 
  • Solution 2: null value assigned to a random key value
    , such as:
select * from log a 
left outer join users b on case when a.user_id is null then concat(‘hive’,rand() ) else a.user_id end = b.user_id; 

Set a reasonable number Map

Summary description of the above

  • 1) Normally, the job is generated by the one or more tasks map the input directory.
    The main determinants are: file block size of the total number of documents input, input file size, cluster setup.
  • 2) map is not the number the better?
    the answer is negative. If a task has many small files (128M much smaller than the block size), then each file will be treated as a small block, with a task to complete the map, and a map initialization task starts and significantly longer than the processing logic time, it will cause a great waste of resources. Moreover, while the number of executable map is limited.
  • 3) ensure that each map is not treated nearly 128m of file blocks, sit back and relax?
    The answer is not necessarily. For example, there is a 127m file, and will normally use a map to complete, but the file has only one or two small fields, there are tens of millions of records, if the process is more complex logic map, a map with a task to do, certainly also time-consuming.
  • For questions 2 and 3 above, we need to take two ways to solve: namely to reduce the number of map and increase the number of map;

Guess you like

Origin www.cnblogs.com/sx66/p/12039563.html