Hive2 optimization parameters

In order to debug hive recently I tried many hive parameters for the hive task optimized to reduce memory usage have some of their own ideas, here to make a record.

One: What is the Hive

 Hive is a data warehouse for large data, is a convert to mapreduce SQL tools.

II: Optimization of the basic processes and operations mapreduce stage can be carried out

   (But in fact there are many do not have to set the default settings or settings After performing for a long time, where do understand the reference set in the next section)

      Attach an old map online (though old but very representative, ha ha ha)     

       

       The above chart know, for mapreduce tasks can be divided into more than a few steps, input, splitting, Mapping, Shuffing, reducing, finalresult, we can optimize each of these stages! !

Splitting phase optimization: the size of the input data in accordance with sub-section, into different blocks.

    Optimization points: 1 is appropriate to increase the size of the cutting block (single node to follow maxsize segmentation, and the rest are combined and then combined minsize size between nodes, and finally the merger between racks.)

         set mapreduce.input.fileinputformat.split.minsize = 1024000000;
                set mapreduce.input.fileinputformat.split.maxsize = 1024000000;(默认256M)
                     set mapreduce.input.fileinputformat.split.minsize.per.node= 1024000000;
                     set mapreduce.input.fileinputformat.split.maxsize.per.node= 1024000000;(默认1b)
                     set mapreduce.input.fileinputformat.split.minsize.per.rack= 1024000000; 
                     set mapreduce.input.fileinputformat.split.maxsize.per.rack= 1024000000;(默认1b)
                     set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;(默认存在)

Mapping phase optimization: The same key a corresponding transition, in fact, depends on the number of map number of data blocks on the final stage of cutting.

    Optimization points: a map manually predetermined number. (Example: Task has 100 map, map but can execute a batch of 10 batches)

      set mapreduce.job.running.map.limit=20;

            2. A predetermined number of map in parallel (example: Task has 100 map, map but can execute a batch of 10 batches)

      set mapreduce.map.memory.mb = 3584; (3.5G memory) will be 2.1 times the default kill

            3. limit the maximum usable memory map

      to true hive.map.aggr = SET
      SET hive.groupby.mapaggr.checkinterval = 100000 (data polymerized per 100,000)

Shuffing phase optimization: the key of the same into a reduce in fact a process of network transmission.

         Optimization points: 1.map after compression (self-extracting compressed)

       set mapreduce.map.output.compress = true (map output compression, mapreduce parameters)
       SET mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.GzipCodec (Map output compression format, mapreduce parameter) 

    Merge file operations after 2.map

       set hive.merge.mapfiles = true (start a new job completion of the merger, the combined set hive.merge.size.per.task decide how much)

reducing phase optimization: accumulate operation data, and transmits the result to the corresponding file.

     Optimization points: 1 manually reduce the number of predetermined

       set mapred.reduce.tasks = 20;

             2. At the same time reduce the number of predetermined parallel (example: Task has 100 reduce, but can reduce the implementation of a number of batch 10)

       set mapreduce.job.running.reduce.limit=80;

             3. restrictions reduce the maximum memory that can be used

       set mapreduce.reduce.memory.mb = 7168; (7G memory) will be 2.1 times the default kill

      4. Set the size of each data processing can reduce (reduce directly determines the number)

       set hive.exec.reducers.bytes.per.reducer=1024*1000*1000;

      5.reduce maximum number

       set hive.exec.reducers.max =2000;(mapreduce.job.running.reduce.limit变相使用)

    After 6.reduce files can be merged

       set hive.merge.sparkfiles = false (spark engine, after the merge file, start a new task)
       the SET hive.merge.tezfiles = false (the TEz engine, after the merge file, start a new task)
       the SET hive.merge.mapredfiles = true (mapreduce engine, after the merge file, start a new task)
       the SET hive.merge.smallfiles.avgsize = 100 * 1000 * 1000 (less than the current value of the output file when the file size of the merger after the end of the task)
       the SET Hive. merge.size.per.task = 1024 * 1000 * 1000 (merge files become much)

finalresult optimization stage: in fact, the process of writing files.

       Optimization point: after 1.reduce to compress written HDFS (each node running the task alone, but the final results need to converge to a)

       set mapreduce.output.fileoutputformat.compress=false // 默认值是 false reduce属性
       set mapreduce.output.fileoutputformat.compress.type=BLOCK // 默认值是 Record reduce属性
       set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec // 默认值是 org.apache.hadoop.io.compress.DefaultCodec

其他优化:

      JVM优化:一个JVM运行的job数是有上限的,我们可以设置最大执行的个数

       set mapreduce.job.jvm.numtasks=100

           并发及并发线程数优化:job的子查询可能没关系,所以可以开启并发查询

       set hive.exec.parallel = true; 
       set hive.exec.parallel.thread.number=8;

      数据倾斜优化:可进行key值个数的判断,判断时候发生数据倾斜

       set hive.optimize.skewjoin=true;
       set hive.skewjoin.key=100000;(超过10000个相同的key就认为是数据倾斜,需要进行打散处理)

         分区优化:hive有桶表和分区表,可开启动态分区(其实就是不同文件夹) 

       set hive.exec.dynamic.partition=true
       set hive.exec.dynamic.partition.mode=nonstrict(分区表分为严格模式和非严格模式)

         job之间优化:

       set hive.exec.compress.output=true;(最终结果压缩。若map压缩和reduce压缩都没有用,改参数使用的话。两个job第一个job后数据不压缩,第二个job输出压缩)
       set hive.exec.compress.intermediate=true(若map压缩reduce压缩最终输出压缩都没有用,改参数使用的话。两个job第一个job后数据压缩,第二个job输出不压缩)

    SQL优化:

      小表往前放

三:常用hive参数优化

   其实上面是对每一个阶段都进行数据优化,有很多参数都是默认开启或者有默认值的。

   只需要用到常用的几个就行,其他的作为了解。下面列举出比较常用的:

Splitting阶段:将输入小文件合并成为大文件

                     set mapreduce.input.fileinputformat.split.minsize = 1024000000;(参数mapreduce.map.memory.mb=3584 默认2.1倍会杀掉,一个map申请3.5G内存不用浪费了)
                     set mapreduce.input.fileinputformat.split.maxsize = 1024000000;
                     set mapreduce.input.fileinputformat.split.minsize.per.node= 1024000000;
                     set mapreduce.input.fileinputformat.split.maxsize.per.node= 1024000000;
                     set mapreduce.input.fileinputformat.split.minsize.per.rack= 1024000000; 
                     set mapreduce.input.fileinputformat.split.maxsize.per.rack= 1024000000;

map阶段一般很快,参数可以不设置

reduce阶段

      set mapreduce.job.running.reduce.limit=80;(例子:任务中有100个reduce,但是可以使reduce分批执行一批10个)

合并文件

      hive合并文件是新启动一个任务合并文件,感觉这个参数不太合适,有这个时间不如直接输出(map和reduce阶段都是一样的)。

压缩文件

      (这个参数十分好,压缩不仅仅节约空间而且在网络传输的时候比较省宽带,mapreduce和spark都是默认可以解压缩的,比较方便。)

      set mapreduce.map.output.compress=true(map输出压缩,map阶段参数)
      set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec(map输出压缩格式,map阶段参数)
      set mapreduce.output.fileoutputformat.compress=false // 默认值是 false reduce阶段参数
      set mapreduce.output.fileoutputformat.compress.type=BLOCK // 默认值是 Record reduce阶段参数
      set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec // 默认值是 org.apache.hadoop.io.compress.DefaultCodec
      set hive.exec.compress.output=true;(最终结果压缩。若map压缩和reduce压缩都没有用,改参数使用的话。两个job第一个job后数据不压缩,第二个job输出压缩)
      set hive.exec.compress.intermediate=true(若map压缩reduce压缩最终输出压缩都没有用,改参数使用的话。两个job第一个job后数据压缩,第二个job输出不压缩)

Jvm优化(建议不设置)

并发优化

      set hive.exec.parallel = true;
      set hive.exec.parallel.thread.number=8;

数据倾斜优化:

      set hive.optimize.skewjoin=true;
      set hive.skewjoin.key=100000;(超过10000个相同的key就认为是数据倾斜,需要进行打散处理)

分区优化(建表的时候要partation by ):

      set hive.exec.dynamic.partition=true
      set hive.exec.dynamic.partition.mode=nonstrict(分区表分为严格模式和非严格模式)

SQL优化

      小表往前放

四:hivesql内存计算过程,mapreduce确定及常用UI端口

1.hive参数配置:

      Hive UI------>>>>Hive Configuration页签找到(版本与版本之间参数的名字不太一样一定要看清楚配置的名字,最好使用最新的配置名称)

      

2.hivesql使用内存资源

      hivesql使用资源可在YARN的管理界面中RUNNING中看到,需要时不时刷新因为是动态的。(一般为3.5g*map个数字+7g*reduce个数)

3.hivesql生成的job的map个数reduce个数

      可在YARN的管理界面中FINISHED中找到job后点击History,进去之后就会看到map数和reduce数

      Splitting块个数决定map个数,reduce个数取决于输出的大小(1G一个reduce)

      

4.常用UI:

      1、HDFS页面:50070

      2、YARN的管理界面:8088

      3、HistoryServer的管理界面:19888

      4、Zookeeper的服务端口号:2181

      5、Hive.server2=10002

      6、Kafka的服务端口号:9092

      7、Hbase界面:16010,60010

      8、Spark的界面:8080

Guess you like

Origin www.cnblogs.com/wuxiaolong4/p/11565220.html