hive optimization basics 1

1. Hive optimization basics 1

Enable bucketing set hive.enforce.bucketing=true;
set the number of reduce set mapreduce.job.reduces=3;

  1. hive table -> orc and parquet --> ZLIB or snappy Parquet is a columnar storage format for analytical business

  2. Fetch capture: mapreduce is not used for global search, field search, limit search, etc. set hive.fetch.task.conversion=more; the default is none.

  3. Local mode:
    a. set hive.exec.mode.local.auto=true;
    b. Adjust local mode threshold set hive.exec.mode.local.auto.inputbytes.max=51234560; default is 128M.
    c. Set local mr The maximum number of input files set hive.exec.mode.local.auto.input.files.max=10; the default is 4.

  4. Large table join small table map join optimization
    a. set hive.auto.convert.join = true; – the default is true
    b. Small table threshold setting: set hive.mapjoin.smalltable.filesize= 25000000; default 25M.

  5. Large table join Large table:
    a. Null key filtering
    Before optimization: SELECT a.* FROM nullidtable a JOIN ori b ON a.id = b.id;
    After optimization: SELECT a.* FROM (SELECT * FROM nullidtable WHERE id IS NOT NULL ) a JOIN ori b ON a.id = b.id;
    b. Empty key conversion
    Sometimes although a certain key is empty, there is a lot of data corresponding to it, but the corresponding data is not abnormal data and must be included in the join result. At this time, we can assign a random value to the field whose key is empty in table a, so that the data is randomly and evenly distributed to different reducers without random distribution:
    SELECT
    a.*
    FROM nullidtable a
    LEFT JOIN ori b ON CASE WHEN a. id IS NULL THEN 'hive' ELSE a.id END = b.id;
    random distribution:
    SELECT a.*
    FROM nullidtable a
    LEFT JOIN ori b ON CASE WHEN a.id IS NULL THEN concat('hive', rand()) ELSE a.id END = b.id;

  6. Sql optimization
    1. Column pruning hive.optimize.cp=true (the default value is true) Only read the columns required by the query
    2. Partition pruning: The partition parameter is: hive.optimize.pruner=true (the default value is true)
    3. group by When the data of a key is too large, the data is skewed.
    (1) Whether to perform aggregation on the Map side, the default is True
    set hive.map.aggr = true;
    (2) The number of entries to perform aggregation operations on the Map side
    set hive.groupby .mapaggr.checkinterval = 100000;
    4. Count(distinct) (Remember, it doesn’t matter if the amount of data is small, and it must not be used if the amount of data is large) Since the COUNT DISTINCT operation needs to be completed with a Reduce Task, the amount of data
    that this Reduce needs to process is too large
    , it will make it difficult to complete the entire job. Generally, COUNT DISTINCT is replaced by GROUP BY first and then COUNT:

  7. Load balancing: Generate two MR Job plans. In the first MR Job in the first job, the
    output results of the Map will be randomly distributed to the Reduce, and each Reduce will perform partial aggregation operations,
    so that the processed results are the same Group By Keys may be distributed to different Reduces, so as to achieve the purpose of load balancing The
    second MR job This process can ensure that the same Group By Key is distributed to the same Reduce
    , and finally complete the final aggregation operation
    (3) with data Load balancing when tilted (default is false)
    set hive.groupby.skewindata = true;

  8. Avoid Cartesian product (avoid join without on condition or invalid condition, hive can only use 1 reduce to complete Cartesian product)

  9. Dynamic partition adjustment
    (1) Enable the dynamic partition function (default true, open) set hive.exec.dynamic.partition=true;
    (2) Set the mode of dynamic partition in non-strict mode, the default is strict, which means that at least one partition must be specified as Static partition
    , nonstrict mode means that all partition fields can use dynamic partition. set hive.exec.dynamic.partition.mode=nonstrict;
    (3) The maximum number of dynamic partitions that can be created on all nodes executing MR. set hive.exec.max.dynamic.partitions=1000;
    (4) The maximum number of dynamic partitions that can be created on each node that executes MR.
    This parameter needs to be set according to the actual data. set hive.exec.max.dynamic.partitions.pernode=100
    (5) The maximum number of HDFS files that can be created in the entire MR Job. set hive.exec.max.created.files=100000;
    (6) Whether to throw an exception when an empty partition is generated. Generally no setting is required. set hive.error.on.empty.partition=false;

  10. data skew

    1. Is it better to have more maps? A lot of small files, each small file will be regarded as a block and completed with a map, and the start and initialization time of a map task is much longer than the logic processing time, which will cause a lot of waste of resources Solution: reduce
      map Number
      set mapred.max.split.size=112345600;
      set mapred.min.split.size.per.node=112345600;
      set mapred.min.split.size.per.rack=112345600;
      three parameters determine the size of the merged file block Size, larger than the file block size of 128m, separated by 128m,
      smaller than 128m, larger than 100m, separated by 100m, and those smaller than 100m (including small files and the rest of the separated large files) are merged
      set
      hive.input .format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
      This parameter indicates that small files are merged before execution.

    Is it guaranteed that each map processes a file block close to 128m, so you can sit back and relax? 127M, only two fields, but tens of millions of records, a map task to do, very time-consuming
    solution: increase the number of maps
    set mapred.reduce.tasks=10; each map task processing is greater than 12M (several million Record) data, the efficiency will definitely be much better.
    Summary: To control the number of maps, you need to follow two principles: make large data volumes use the appropriate number of maps; make a single map task process an appropriate amount of data; 2. The number of reduce
    1.
    Hive itself How to determine the number of reduce:
    Set the amount of data processed by each reduce task
    set hive.exec.reducers.bytes.per.reducer=524288000; the amount of data processed by each reduce task defaults to 1G: If the total size of the source file is more than 9G, this time There are 20 reducers.
    The maximum number of reducers for each task
    is set hive.exec.reducers.max=999. The default is 999.
    The formula for calculating the number of reducers is N=min (total input data volume/parameter setting reduce processing data volume)
    a. Then the number of reducers Is the more the better?
    Like map, starting and initializing reduce will also consume time and resources; in addition, as many reduce as there are, there will be as many output files. If many small files are generated, then if
    these small files As the input of the next task, there will also be a problem of too many small files;
    2. Under what circumstances is there only one reduce? In
    many cases, you will find that no matter how much data is in the task, no matter whether you have adjusted the number of reduce parameters,
    There is always only one reduce task in the task; in fact, there is only one reduce task, except that the amount of data is
    less than the hive.exec.reducers.bytes.per.reducer parameter value, there are the following reasons:
    there is no summary example of group by
    : select pt,count(1) from tab_info where pt = '2020-07-04' group by pt;
    select count(1) from tab_info where pt = '2020-07-04';
    Order by is used
    to have a Cartesian product.
     Note: When setting the number of reduce, you also need to consider these two principles: use the appropriate number of reduce for large data volume; process the appropriate amount of data for a single reduce task; 11. Parallel execution
    set
    hive.exec.parallel=true; --Enable task parallel execution
    set hive.exec.parallel.thread.number=16; --The same sql allows the maximum parallelism, the default is 8.
    Hive converts a query into one or more stages. Such stages can be MapReduce stage, sampling stage, merge stage, limit stage.
    By default, Hive will only execute one stage at a time.
    However, a specific job may contain many stages, and these stages may not be completely interdependent, which means that some stages can be executed in parallel,
    which may shorten the execution time of the entire job
    . However, in a shared cluster , it should be noted that if there are more parallel stages in the job, the cluster utilization will increase.

  11. Strict mode
    set hive.mapred.mode = strict; --Enable strict mode
    set hive.mapred.mode = nostrict; --Enable non-strict mode
    Enabling strict mode can prohibit 3 types of queries.
    1) For a partitioned table, the where statement must contain a partition field as a filter condition to limit the range, otherwise it is not allowed to execute.
    2) For queries that use the order by statement, the limit statement must be used.
    Because order by will distribute all the result data to the same Reducer for processing in order to perform the sorting process
    3) Restrict Cartesian product query
    13. JVM reuse
    JVM reuse can make the JVM instance reuse N times in the same job. The value of N can be configured in Hadoop's mapred-site.xml file.
    set mapred.job.reuse.jvm.num.tasks=10; manually set. The default company is configured

  12. Speculative execution
    set mapred.map.tasks.speculative.execution=true
    set mapred.reduce.tasks.speculative.execution=true
    set hive.mapred.reduce.tasks.speculative.execution=true;
    Speculative execution: Symptom: When running the program, It is found that a certain task in a program cannot be finished for a long time.
    Solution: If speculative execution is enabled, the task runs too slowly, and the program will restart the same task and assign it to the
    machine to run. Whoever finishes running first, the other task will The configuration can be changed after being terminated
    If the user needs to execute a long-term map or reduce task due to a large amount of input data, the waste caused by starting speculative execution is very large.

Guess you like

Origin blog.csdn.net/mitao666/article/details/110470913