HIVE optimization study

1 Overview

  Continue " those years using Hive stepped pit the remaining part of the article," this blog go into summary Hive common optimization methods in their work and the use of Hive appear at work. The following describes the start optimizing this article.

2. Introduction

  First, we look at computing framework Hadoop properties, in which characteristics of this issue will give rise to?

  • Large amount of data is not a problem, data skew is a problem.
  • Relatively large number of jobs the number of jobs operating efficiency is relatively low, for example, even if there are hundreds of lines of the table, if repeated many times associated summary, generate more than a dozen jobs, it takes a long time. The reason is that map reduce jobs initialization time is relatively long.
  • sum, count, max, min, etc. UDAF, afraid of data skew problems, Hadoop map optimized in Consolidated end, inclined so that the data is not a problem.
  • COUNT (distinct), in the case where a large amount of data, low efficiency, if the multi-COUNT (distinct) is less efficient, because COUNT (distinct) field of the packet is based group by, sorted distinct field, this is generally distributed manner it is slanted. For example: such men uv, women uv, like Taobao 30 a day million pv, if grouped by gender, allocate 2 reduce, reduce each data processing 1.5 billion.

  Faced with these problems, which we can have efficient means to optimize it? Here are some effective and feasible means of optimization work in:

  • Good model design more with less.
  • Solve the problem of data skew.
  • Reduce the number of job.
  • Set a reasonable map reduce the number of task, can effectively improve performance. (For example, 10w + level calculated by 160 reduce, it is quite wasteful, a sufficient).
  • Understand data distribution, data skew their own hands to solve the problem is a good choice. set hive.groupby.skewindata = true; that is generic optimization algorithm, but the algorithm can not adapt to a particular business context sometimes optimization, service developers to understand, learn data, data skew problem can be solved by accurate and efficient business logic.
  • The amount of data is large, the caution count (distinct), count (distinct) inclined prone problem.
  • Small file merge, the line is scheduled to effectively improve the efficiency of the method, if all the jobs set a reasonable number of files, it will have a positive impact on the overall positive scheduling efficiency of the ladder.
  • To grasp the whole, not as good as the best single job the best overall optimization.

  And the next, our hearts should have some roots of doubt, what is the impact on performance?

3. The root causes of poor performance

  When the hive performance optimization, as the HiveQL M / R read program, i.e., to optimize the performance from a consideration of the operation angle M / R from the lower level to think about how to optimize performance computing, not just replace the logic level code.

  RAC (Real Application Cluster) Real Application Clusters as a mobile and flexible minivan, fast response; Hadoop is like a huge ship throughput, high start-up cost, if only a small number of each input and output, utilization will very low. So good Hadoop with the primary task is to increase the amount of data each time the task equipped.

  Hadoop's core competence is parition and sort, which is therefore a fundamental optimization.

  Observation Hadoop data processing process, there are several notable features:

  • Not the focus of large-scale data load, resulting in excessive operating pressure because tilt operating data.
  • Relatively large number of jobs operating efficiency is relatively low jobs, such as a few hundred lines even with the table, if this repeatedly associated summary, produce dozens of jobs, will require more than 30 minutes and most of the time is used for job allocation, initialization and data output. M / R initialization operation time is a time-consuming part of the comparison resources.
  • In use SUM, COUNT, MAX, MIN and other functions UDAF fear of data skew problem, in the Hadoop Map Consolidated end optimized, so that the data is not a problem inclined.
  • COUNT (DISTINCT) in the case where a large amount of data, less efficient, if multiple COUNT (DISTINCT) is less efficient, because COUNT (DISTINCT) field of the packet is based GROUP BY, DISTINCT sorted field, which typically is a distributed manner it is inclined; example: M UV, female UV, Taobao 30 a day million PV, if grouped by gender, assigned 2 reduce, reduce each data processing 1500000000.
  • Data skew is the main reason resulting in substantially reduced efficiency, a multi Map / Reduce may be employed, avoid tilting.

  Are finally concluded: evasive, by increasing the number of Job, increasing the amount of input, use more storage space, full use of various methods such as a CPU is idle, the burden of data skew caused by decomposition.

4. The arrangement angle optimization

  We know that the root cause of poor performance, and similarly, we can configure to optimize reading from the Hive. Hive internal systems have been set for different pre-query optimization method, the user can be controlled by adjusting the configuration, the following example describes the part of the optimization strategy and optimize the control options.

4.1 Crop

  Hive when reading data, you can only read columns in the query need to use, while ignoring other columns. For example, if the following query:

SELECT a,b FROM q WHERE e<10;

  In this embodiment the query, Q table has five columns (a, b, c, d, e), Hive read queries only true logic required 3 a, b, e, and ignore the column c, d; so It does save overhead reading, intermediate table storage overhead and data integration overhead.

  Cutting parameter corresponding entry is: hive.optimize.cp = true (the default is true)

4.2 Partition cut

  You can reduce unnecessary partitions during query. For example, if the following query:

SELECT * FROM (SELECTT a1,COUNT(1) FROM T GROUP BY a1) subq WHERE subq.prtn=100; #(多余分区) 
SELECT * FROM T1 JOIN (SELECT * FROM T2) subq ON (T1.a1=subq.a2) WHERE subq.prtn=100;

  If the query "subq.prtn = 100" conditions into sub-queries more efficiently, the number of partitions reading can be reduced. Hive automatically perform this cutting optimization.

  Subdivision parameters: hive.optimize.pruner = true (the default is true)

4.3JOIN operation

  When writing code statement with a join operation, a small entry table / sub-query should be on the left Join operator. Because Reduce stage, Join operation table of contents is located on the left character is loaded into memory, loading less entry table can be effectively reduced OOM (out of memory) that is out of memory. So for a key with it, before and after the value put small value, corresponding to a large place, and this is "before a small table to put" principle. If a plurality of Join statement, based Join conditions identical or not, different processing methods.

4.3.1JOIN principle

  There is a principle in the use of written queries Join operations: small entry table / child should be placed Join query operators on the left. The reason is that in the stage Reduce Join operation, Join operator located on the left table of contents will be loaded into memory, the low entry table on the left, can effectively reduce the chance of errors occurring OOM. For a statement of the case there are multiple Join, Join same conditions if, for example query:

INSERT OVERWRITE TABLE pv_users 
 SELECT pv.pageid, u.age FROM page_view p 
 JOIN user u ON (pv.userid = u.userid) 
 JOIN newuser x ON (u.userid = x.userid);  
  • Join the key if the same, no matter how many tables will be will be merged into a Map-Reduce
  • Map-Reduce a task, instead of 'n'
  • In doing OUTER JOIN is the same time

  Join if conditions are not the same, such as: 

INSERT OVERWRITE TABLE pv_users 
   SELECT pv.pageid, u.age FROM page_view p 
   JOIN user u ON (pv.userid = u.userid) 
   JOIN newuser x on (u.age = x.age);   

  Map-Reduce the number of tasks and the number of operations corresponding to Join the query and the query is equivalent to the following: 

INSERT OVERWRITE TABLE tmptable 
   SELECT * FROM page_view p JOIN user u 
   ON (pv.userid = u.userid);
 INSERT OVERWRITE TABLE pv_users 
   SELECT x.pageid, x.age FROM tmptable x 
   JOIN newuser y ON (x.age = y.age);    

4.4MAP JOIN operation

  Join operation is completed in the Map stage, no longer need to Reduce, provided that the required data in the course of the Map can be accessed. For example, the query: 

INSERT OVERWRITE TABLE pv_users 
   SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age 
   FROM page_view pv 
     JOIN user u ON (pv.userid = u.userid);    

  Join Map can be done in stages, as shown: 

  Relevant parameters are:

  • hive.join.emit.interval = 1000 
  • hive.mapjoin.size.key = 10000
  • hive.mapjoin.cache.numrows = 10000

4.5GROUP BY operation

  Note the following points when GROUP BY operation:

  • Map end portion of the polymerization

  In fact not all polymerization operations need to be reduce in part, a lot of polymerization operations can be carried out first partially polymerized in the Map end, then reduce end get the final result.

  Here we need to modify the parameters as follows:

  hive.map.aggr = true (for setting whether to end the polymerization in the map, the default is true) (map for setting the number of entries in the end of the polymerization operation) hive.groupby.mapaggr.checkinterval = 100000

  • Load balancing when data skew

  Here you need to set hive.groupby.skewindata, when the option is set to true, the resulting query plan has two MapReduce tasks. In the first MapReduce, the output map set to reduce randomly distributed in each part do reduce polymerization operation, and outputs the result. The consequence of this is that the same Group By Key possible to reduce the distribution of different, so as to achieve load balancing purposes; MapReduce second task based on the data then pre-processed in accordance with the results of Group By Key to reduce the profile (this process you can guarantee the same Group By Key distribution to reduce the same), and finally complete the final polymerization operation.

4.6 merge small files

  We know that a small number of files, file storage is likely to cause a bottleneck in the end, to bring pressure HDFS, affect the processing efficiency. In this regard, it is possible to eliminate such effects by combining Map and Reduce outcome document.

  The combined parameter sets of attributes are:

  • Map merger of the output file: hive.merge.mapfiles = true (default is true)
  • Reduce end whether to merge the output file: hive.merge.mapredfiles = false (default is false)
  • The combined size of the file: hive.merge.size.per.task = 256 * 1000 * 1000 (default is 256 million)

The angle optimization procedure

Familiar with SQL 5.1 improves query

  Skillfully using SQL, to write efficient queries.

  Scenario: There is a user table, as the seller receives a daily table, user_id, ds (date) for the key, attributes the main categories, there are indicators of the transaction amount, number of transactions. To take the first 10 days of total revenue, total items, and Business Scope recent day every day.   Workaround 1

  As follows: The method used

Copy the code

TABLE T1 the OVERWRITE the INSERT 
the SELECT user_id, substr (MAX (CONCAT (DS, CAT),. 9) the AS main_cat) the FROM Users 
the WHERE DS = 20120329 20120329 // value date column, the actual code may be expressed by a function of the current date GROUP BY user_id; 

the INSERT TABLE T2 the OVERWRITE 
the SELECT user_id, SUM (the qty) the qty the AS, the SUM (AMT) AMT the AS the FROM Users 
the WHERE the AND 20120329 20120301 DS the BETWEEN 
the GROUP BY user_id 

the SELECT t1.user_id, t1.main_cat, t2.qty, the FROM t2.amt T1 
the JOIN T2 = t2.user_id the ON t1.user_id

Copy the code

  The following ideas given in the method 1, steps are as follows:

  The first step: the use of analytic functions, taking each user_id Business Scope recent day, into a temporary table t1.

  Step Two: Summary 10-day total transaction amount, transaction number, stored in the temporary table t2.

  The third step: association t1, t2, to give the final result.

  Solution 2

  As follows: Optimization 

SELECT user_id,substr(MAX(CONCAT(ds,cat)),9) AS main_cat,SUM(qty),SUM(amt) FROM users 
WHERE ds BETWEEN 20120301 AND 20120329 
GROUP BY user_id

  At work we summarize: Scenario 2 is equal to the cost of the second step of the cost, performance improvement scheme 1, from original 25 minutes to complete, shortened to less than 10 minutes to complete. Save read and write two temporary tables is a key reason, this approach also applies to data in Oracle to find work. 

      SQL is universal, many common SQL optimization scheme Hadoop distributed computing can also achieve the desired effect.

5.2 invalid ID data skew problems associated

  Problem: Log Central Standing Committee loss of information, such as daily about 20 million in the entire network logs, which user_id primary key, lost in the log collection process, there is null, the primary key, user_id and bmw_users association If you take them , you will encounter problems inclination data. Hive reason is, the primary key for the item as a null value will be assigned the same Key into the same computing Map.

      Solution 1: user_id is empty does not participate in the associated sub-query filter null

SELECT * FROM log a 
JOIN bmw_users b ON a.user_id IS NOT NULL AND a.user_id=b.user_id 
UNION All SELECT * FROM log a WHERE a.user_id IS NULL

  Solution 2 as follows: Filter function null 

SELECT * FROM log a LEFT OUTER 
JOIN bmw_users b ON 
CASE WHEN a.user_id IS NULL THEN CONCAT(‘dp_hive’,RAND()) ELSE a.user_id END =b.user_id;

  Tuning Results: Since the original data skew result in long runs more than 1 hour, a solution daily average amount of time to run for 25 minutes, about 20 minutes long at an average daily run 2 solution. Optimization effect is obvious.

  We summarize the work out: Solution 2 Solution 1 is better than the effect, not only less IO, and the number of jobs is also less. Solution 1 log read twice, job number 2. Solution 2 is the number of job 1. This optimized for invalid id (such as -99, '', null, etc.) of the tilt problems. The key becomes null string with a random number, the inclination can be assigned different data Reduce, thereby solving the problem of data skew. Because null values ​​do not participate associate, even into different Reduce on, it will not affect the final result. Like reference implementation is associated Hadoop: related by a secondary sort implemented as associated partion key, tag and the column consisting ordered group key associated with the table, according to pariton key assignment Reduce. The group key in the same sort Reduce.

5.3 different data types associated problems of inclination

  Question: id associated with different data types of data skew problems occur.

  A table s8 log, a record of each commodity to commodity, and the associated table. But it ran tilt associated problems. s8 log 32 id string of commodities, commodities have value id, the log type is a string, but the value id of commodities is bigint. The reason s8 guess the problem is to turn into commodities id id do hash value is allocated Reduce, so the string id of s8 logs are on a Reduce, and the solution to verify this speculation.

  Solution: converting data types into a string type

 

SELECT * FROM s8_log a LEFT OUTER 
JOIN r_auction_auctions b ON a.auction_id=CASE(b.auction_id AS STRING) 

  Tuning The results show: the data table processing can be completed in one hour and 30 minutes after the code was adjusted within 20 minutes.

5.4 UNION ALL optimization of the use of Hive features

  Multi-table union all will be optimized for a job.

  Question: For example promotion effect and associated merchandise table to table, the effect of both columns of the table auction_id 32 commodities to a string id, there are digital id, and related merchandise table to get information about the products.

  Solution: Hive SQL performance would be better

SELECT * FROM effect a 
JOIN 
(SELECT auction_id AS auction_id FROM auctions 
UNION All 
SELECT auction_string_id AS auction_id FROM auctions) b 
ON a.auction_id=b.auction_id 

  Ratios were filtered digital id, string id and product table and associated properties are better.

  The benefits of this writing: a MapReduce jobs, merchandise table read only once, and the effects table read only once. Into this SQL Map / Reduce code words, Map, when a recording is to tag a table, each table record reading a commodity, to tag B, into two <key, value> pair of, <(b digital id), value>, <(b, string id), value>.

  So HDFS product table is read only once.

5.5 to solve the short board Hive UNION ALL optimization

  Hive characteristics of the union all of the optimization: the optimization of the union all confined to non-nested queries.

  • The eradication group by in a subquery

     Example 1: There is a sub-group by query 

SELECT * FROM 
(SELECT * FROM t1 GROUP BY c1,c2,c3 UNION ALL SELECT * FROM t2 GROUP BY c1,c2,c3)t3 
GROUP BY c1,c2,c3 

  From the business logic that how GROUP BY in subqueries are looking redundant (redundant in function, unless COUNT (DISTINCT)), if not for consideration on the Hive Bug or performance (there have if you do not execute sub-queries GROUP BY, data can not be the correct result of Hive Bug). So this Hive converted to empirically as follows:

SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t2)t3 GROUP BY c1,c2,c3 

  Tuning Results: After testing, did not show Hive Bug union all, the data is consistent. MapReduce reduce the number of jobs is 3-1. 

     a directory corresponding to t1, t2 corresponds to a directory of the Map / Reduce program is, t1, t2 as mutli inputs Map / Reduce job. This can be solved this problem by a Map / Reduce. Hadoop computational framework, multiple data afraid, afraid multiple jobs.

  However, if it were other computing platforms such as Oracle, it is not necessarily because of the large input split into two inputs, are summarized and sorted merge (sorted if two parallel sub-word), it is possible to better performance (such as better performance than Hill sorting bubble sort).

  • COUNT (DISTINCT) in the elimination of sub-queries, MAX, MIN.
SELECT * FROM 
(SELECT * FROM t1 
UNION ALL SELECT c1,c2,c3 COUNT(DISTINCT c4) FROM t2 GROUP BY c1,c2,c3) t3 
GROUP BY c1,c2,c3; 

  Since the sub-query inside there COUNT (DISTINCT) operation, go directly to the GROUP BY will not reach business goals. Then eliminate the use of temporary tables COUNT (DISTINCT) work not only solve the problem of tilt, but also effectively reduce jobs.

INSERT t4 SELECT c1,c2,c3,c4 FROM t2 GROUP BY c1,c2,c3; 
SELECT c1,c2,c3,SUM(income),SUM(uv) FROM 
(SELECT c1,c2,c3,income,0 AS uv FROM t1 
UNION ALL 
SELECT c1,c2,c3,0 AS income,1 AS uv FROM t2) t3 
GROUP BY c1,c2,c3;

  Job number is 2, reduced by half, and twice Map / Reduce more efficient than COUNT (DISTINCT).

     Tuning results: Do-level categories list, member list, was associated with a one billion merchandise table. Original 1963s mission adjusted, 1152s is complete.

  • JOIN in the elimination of sub-queries
SELECT * FROM 
(SELECT * FROM t1 UNION ALL SELECT * FROM t4 UNION ALL SELECT * FROM t2 JOIN t3 ON t2.id=t3.id) x 
GROUP BY c1,c2; 

  The above code will run five jobs. JOIN added to the survival of the temporary table, then t5, then UNION ALL, will become two jobs.

INSERT OVERWRITE TABLE t5 
SELECT * FROM t2 JOIN t3 ON t2.id=t3.id; 
SELECT * FROM (t1 UNION ALL t4 UNION ALL t5); 

  Tuning The results show: million level for the advertising table, a total of 15 from the original Job 5 minutes, divided into two job a 8-10 minutes, a 3 minutes.

Alternatively 5.6GROUP BY COUNT (DISTINCT) to optimize results

  Uv computing time, often used COUNT (DISTINCT), but in the data comparison tilt when COUNT (DISTINCT) will be slower. Then you can try to rewrite the code calculated by uv GROUP BY.

  • Legacy code
INSERT OVERWRITE TABLE s_dw_tanx_adzone_uv PARTITION (ds=20120329) 
SELECT 20120329 AS thedate,adzoneid,COUNT(DISTINCT acookie) AS uv FROM s_ods_log_tanx_pv t WHERE t.ds=20120329 GROUP BY adzoneid

  About COUNT (DISTINCT) of data skew problem can not be generalized to the case may be, here is a set of data I tested:

  Test data: 169,857

Daily statistics # IP 
the CREATE TABLE AS ip_2014_12_29 the SELECT COUNT (DISTINCT ip) AS IP the FROM logdfs the WHERE LogDate = '2014_12_29'; 
time-consuming: 24.805 seconds The 
# statistics daily IP (reform) 
the CREATE TABLE AS ip_2014_12_29 the SELECT COUNT (1) AS IP fROM (SELECT DISTINCT ip from logdfs WHERE logdate = '2014_12_29') tmp; 
Processed: 46.833 seconds

  Test results table: the statement after consuming a significant transformation than before, because the statement after transformation has two SELECT, more than a job, so that a small amount of data, the data does not exist tilt problem.

6. optimization summary

  When optimized, the hive sql as mapreduce program to read, there will be an unexpected surprise. Hadoop ability to understand the core is the fundamental hive optimization. This is this year, all members of the project team valuable lessons learned.

  • Hadoop during long-term observation of the data processing, there are several notable features:
  1. Data and more afraid, afraid of data skew.
  2. The number of jobs to more job operating efficiency is relatively low, for example, even if there are hundreds of lines of the table, if repeated many times associated summary, generate more than a dozen jobs, not half an hour is a never-ending run. map reduce jobs initialization time is relatively long.
  3. To sum, count, the data skew problem does not exist.
  4. To count (distinct), low efficiency, the amount of data over one, quasi problem, if it is a multi-count (distinct) less efficient.
  • Optimization can proceed in several ways:
  1. Good model design more with less.
  2. Solve the problem of data skew.
  3. Reduce the number of job.
  4. Set a reasonable map reduce the number of task, can effectively improve performance. (For example, 10w + level calculated by 160 reduce, it is quite wasteful, a sufficient).
  5. Write himself sql data skew problem solving is a good choice. set hive.groupby.skewindata = true; this is a common algorithm optimization, optimization algorithms but always ignore business, habitually provide a common solution. Etl developer a better understanding of the business, a better understanding of the data, the solutions tend to business logic inclined by more accurate, more effective.
  6. Adopt methods disregard the count (distinct), especially in large data when it is prone to tilt problem, do not leave things to chance. living comfortably without anybody's help.
  7. Small file merge, the line is scheduled to effectively improve the efficiency of the method, if our job to set a reasonable number of files, it will have a positive impact on the overall efficiency of the scheduling of the ladder.

  To grasp the whole, not as good as the best single job the best overall optimization.

7. common means of optimizing

  Mainly determined by three attributes:

  • hive.exec.reducers.bytes.per.reducer # This parameter controls how much a job will have to deal with a reducer, based on the total size of the input file. The default 1GB.
  • hive.exec.reducers.max # This parameter controls the maximum amount of the reducer, if reduce the number of input / bytes per reduce> max will start the specified parameter. This does not affect the setting mapre.reduce.tasks parameters. The default is 999 max.
  • mapred.reduce.tasks # If this parameter is specified, hive will not use its estimation function to automatically calculate reduce the number, but use this parameter to start reducer. The default is -1.

7.1 impact parameter settings

  If you reduce too: if the amount of data is large, this will lead to reduce abnormal slow, resulting in this task can not be completed, there may be OOM 2, If you reduce too much: too many small files generated merge the price is too high, namenode memory usage increases. If we do not specify mapred.reduce.tasks, hive will automatically calculate how many reducer needs.

8. Conclusion

  This blog will share here, back again in a good optimization tools and share with you, thank you for taking the time in his busy schedule to read my this blog, if you have any questions during the optimization of the group can be added discussions or send mail to me, I will do my best to answer your questions, and the king of mutual encouragement!

Guess you like

Origin blog.csdn.net/oZuoLuo123/article/details/87182149