Hive (eight) performance optimization

8. Performance optimization

8.1 Hive transaction

1. Transaction (Transaction) refers to a set of unitized operations, which are either executed or not executed

ACID characteristics:

  • Atomicity: Atomicity
  • Consistency: Consistency
  • Isolation: isolation
  • Durability: Durability

2. Features and limitations of Hive transactions

  • V0.14 version began to support row-level transactions
    • Support INSERT, DELETE, UPDATE (Merge is supported since v2.2.0)
    • File format only supports ORC
  • Limitations
    • The table must be bucketed
    • Need to consume additional time, resources and space
    • Does not support start, commit, rollback, bucket or partition column updates
    • The lock can be a shared lock or an exclusive lock (series instead of concurrent)
    • It is not allowed to read and write ACID tables from a non-ACID connection
    • Less used

3. Opening and setting of Hive transaction

  • Set through the Hive command line, the current session is valid
  • Set by configuration file, globally effective
  • Set via UI tools (such as Ambari)
-- 通过命令行方式开启事务
set hive.support.concurrency = true;
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 1; 
-- 通过配置文件hive-site.xml
<property> 
<name>hive.support.concurrency</name> 
<value>true</value>
 </property>
 <property> 
<name>hive.txn.manager</name> <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>

Insert picture description here

8.2 Hive PLSQL

  • Hive PLSQL: Hive stored procedure (after v2.0)
    • Support SparkSQL and Impala
    • Compatible with Oracle, DB2, MySQL, TSQL standards
    • Make it simple and efficient to migrate existing processes to Hive
    • Make writing UDF without Java skills
    • Its performance is slightly slower than Java UDF
    • Newer function
  • Run ./hplsql in the Hive2 bin directory
./hplsql -f plsql_demo.pl
RETURNS STRING 
BEGIN RETURN 'Hello, ' || text || '!'; 
END;
Print hello(' word') 

CREATE PROCEDURE getCount()
BEGIN DECLARE cnt INT = 0;	
SELECT COUNT(*) INTO cnt FROM employee;
PRINT 'Users cnt: ' || cnt;
END;
call getCount();

8.3 Hive performance tuning tool

1.EXPLAIN
Insert picture description here

2.ANALYZE

  • ANALYZE: Analysis table data, used as a reference for execution plan selection
    • Collect table statistics, such as number of rows, maximum value, etc.
    • Use this information to speed up query
  • grammar
ANALYZE TABLE employee COMPUTE STATISTICS; 

ANALYZE TABLE employee_partitioned 
PARTITION(year=2014, month=12) COMPUTE STATISTICS;

ANALYZE TABLE employee_id COMPUTE STATISTICS 
FOR COLUMNS employee_id;

8.4 Hive optimized design

  • Use partition table, bucket table
  • Use index
  • Use appropriate file format, such as orc, avro, parquet
  • Use an appropriate compression format, such as snappy
  • Consider data localization-add some copies
  • Avoid small files
  • Use Tez engine instead of MapReduce
  • Use Hive LLAP (read cache in memory)
  • Consider turning off concurrency when not needed

8.5 Job optimization

1. Run in local mode

Hive supports automatic conversion of jobs to run in local mode.
When the data to be processed is small, the startup time of fully distributed mode is longer than job processing time

-- 通过以下设置开启本地模式
SET hive.exec.mode.local.auto=true; --default false 
SET hive.exec.mode.local.auto.inputbytes.max=50000000; 
SET hive.exec.mode.local.auto.input.files.max=5; --default 4

  • Job must meet the following conditions to run in native mode
    Job total input size is less than hive.exec.mode.local.auto. Inputbytes.max
    the total number of map tasks less than hive.exec.mode.local.auto. Input.files.max
    required The total number of Reduce tasks is 1 or 0

2. JVM Reuse (JVM Reuse)

  • Reduce JVM startup consumption through JVM reuse
    • By default, each Map or Reduce starts a new JVM
    • When the running time of Map or Reduce is very short, the JVM startup process takes up a lot of overhead
    • Reuse JVM by sharing JVM and run MapReduce Job in serial mode
    • Applicable to Map or Reduce tasks in the same Job
    • For tasks of different Jobs, always run in a separate JVM
-- 通过以下设置开启JVM重用
set mapred.job.reuse.jvm.num.tasks = 5;  -- 默认值为1

3. Parallel execution

  • Parallel execution can improve cluster utilization
    • Hive queries are usually converted into many stages executed in the default order
    • These stages are not always interdependent
    • They can be run in parallel to save overall job running time
    • If the utilization of the cluster is already high, parallel execution will not help much
-- 通过以下设置开启并行执行
SET hive.exec.parallel=true;  -- default false 
SET hive.exec.parallel.thread.number=16;  -- default 8,定义并行运行的最大数量

8.6 Query optimization

  • Automatically start Map-side Join
  • Prevent data skew
set hive.optimize.skewjoin=true;	
  • Enable CBO (Cost based Optimizer)

    set hive.cbo.enable=true; 
    set hive.compute.query.using.stats=true; 
    set hive.stats.fetch.column.stats=true; 
    set hive.stats.fetch.partition.stats=true;	
    

    Start Vectorization (vectorization)

    set hive.vectorized.execution.enabled = true; 
    set hive.vectorized.execution.reduce.enabled = true;
    
    

    Use correct coding conventions such as CTE, temporary tables, window functions, etc.

8.7 Compression Algorithm

  • Reducing the amount of data transferred will greatly improve the performance of MapReduce
  • Using data compression is a good way to reduce the amount of data
  • Comparison of commonly used compression methods
Compression method Divisible Compressed size Compression and decompression speed
gzip no in in
lzo Yes Big fast
snappy no Big fast
bzip2 Yes small slow

Guess you like

Origin blog.csdn.net/zmzdmx/article/details/108742998