Big Data Hive tool

Original link: https://mp.weixin.qq.com/s/ZhW4n7hldadiFjCGHsz88A

About Hive

definition

Facebook To address the massive log data analysis and the development of a hive, and later to the Apache Foundation, an open source organization. hive is a SQL statement to assist with software to read and write data warehousing, big data sets stored on the management of HDFS.

hive Features

▪ hive biggest feature is analyzed by a large class of SQL data, and avoid writing mapreduce Java program to analyze the data, which makes it easier to analyze the data. ▪ Data is stored in HDFS's, hive itself does not provide data storage function of ▪hive is to map the data into a metadata information database and a picture of the table, library and tables are generally present on relational databases (such as MySQL) . ▪ data storage: He is capable of storing large data sets, and data integrity, the format is not critical. ▪ data processing: Not available for real-time computing and responses used for off-line analysis. 
 
 
 

hive basic grammar

hive statements somewhat similar to mysql statement, divided into DDL, DML, DQL, I do not do this in detail, and point out that the hive a little table into the internal table, external table, partition table and sub-tables barrel, a sense of interested readers can access relevant information.

Hive principle

hive Chart

hive kernel

hive kernel is the driving engine, driven engine consists of four parts, the four parts are: ▪ interpreter: the role of the interpreter is to convert hiveSQL statement syntax tree (AST). ▪ compiler: the compiler is compiled syntax tree is a logical implementation plan. ▪ optimizer: The optimizer is the logical execution plan optimization. ▪ Actuator: Actuator is running calls the underlying framework of the implementation logic implementation plan. 
 
 
 

hive underlying storage

hive data is stored on HDFS, hive in the table can be seen as a library and is on HDFS data to do a map, so hive must be running on a hadoop cluster.

hive program execution

hive actuator is the final mapreduce program to be executed in order to put on a series of job YARN way to do it.

hive metadata storage

hive metadata is generally produced by the interaction between MetaStore service on this MySQL relational database, hive with MySQL storage.

Table 2 hive metadata infos

Owner The owner of the library, table CreateTime Created
LastAccessTime Last Modified Time Location storage location
Table Type Table type (internal table, external table) Table Field Field Information table

hive client

hive there are many clients, below simply lists a few:
▪ cli command line client: The interactive window to communicate with a hive command line and hive. ▪ hiveServer2 clients: communicate with Thrift protocol, Thrift is a converter between different languages, is the connection protocol between different language program, through JDBC or ODBC to access the hive (this is the connection hive official website recommended). ▪ HWI client: hive comes with the client, but a bit rough, generally do not. ▪ HUE clients: to hive through Web pages and interactive, using a comparator, generally integrated in the CDH. 
 
 

Hive Tuning

hadoop huge ship like throughput, large start-up cost, if every time a small number of inputs and outputs only utilization will be low. So good hadoop with the primary task is to increase the amount of data each time the task equipped. When the hive optimization, the hive Sql as mapreduce program to read, rather than as SQL to read. This hive optimization from three levels, namely, optimization and hiveQL layer optimization based mapreduce optimization, hive architectural layers.

Based mapreduce optimization

Set a reasonable number of map

The above section describes the implementation process mapreduce, the HDFS file will first be fragmented before performing the map function, the resulting fragment map as a function of the input, the number depending on the map map input slice (InputSplit), a slicing of the input corresponds to a map task, slicing of the input is determined by three parameters, as shown in table 3:

Table 3  slicing of the input parameters determined

parameter name Defaults Remark
dfs.block.size 128M Size of the data block HDFS
mapreduce.min.split.size 0 The minimum amount piece number
mapreduce.max.split.size 256M The maximum number of slices

Formula: fragment size = max (mapreduce.min.split.size, min (dfs.block.size, mapreduce.max.split.size)), and slice dfs.block.size default size is the same, i.e., a block corresponding to a data input HDFS fragment, corresponding to a map task. At this time, only a map task processing a block of data on a machine, data need not be transmitted across a network, to improve the data processing speed.

Set reasonable reduce the number of

reduce the number of parameters determined as shown in Table 4:

Table 4 decided to reduce the number of parameters

parameter name Defaults Remark
hive.exec.reducers.bytes.per.reducer 1G Reduce the amount of a data size
hive.exec.reducers.max 999 hive maximum number of
mapred.reduce.tasks -1 Reduce task number, and -1 is automatically adjusted according hive.exec.reducers.bytes.per.reducer

It is possible to manually adjust the number of set mapred.reduce.tasks reduce task.

hive architecture layer optimization

Does not perform mapreduce

HDFS data read from the hive, there are two ways: Enable mapreduce read directly crawl.

set hive.fetch.task.conversion=more

 

hive.fetch.task.conversion参数设置成more,可以在 select、where 、limit 时启用直接抓取方式,能明显提升查询速度。

本地执行mapreduce

hive在集群上查询时,默认是在集群上N台机器上运行,需要多个机器进行协调运行,这个方式很好地解决了大数据量的查询问题。但是当hive查询处理的数据量比较小时,其实没有必要启动分布式模式去执行,因为以分布式方式执行就涉及到跨网络传输、多节点协调等,并且消耗资源。这个时间可以只使用本地模式来执行mapreduce  job,只在一台机器上执行,速度会很快。

JVM重用

因为hive语句最终要转换为一系列的mapreduce job的,而每一个mapreduce job是由一系列的map task和Reduce task组成的,默认情况下,mapreduce中一个map task或者一个Reduce task就会启动一个JVM进程,一个task执行完毕后,JVM进程就退出。这样如果任务花费时间很短,又要多次启动JVM的情况下,JVM的启动时间会变成一个比较大的消耗,这个时候,就可以通过重用JVM来解决。

set mapred.job.reuse.jvm.num.tasks=5

 

 

这个设置就是制定一个jvm进程在运行多次任务之后再退出,这样一来,节约了很多的 JVM的启动时间。

并行化

一个hive sql语句可能会转为多个mapreduce job,每一个job就是一个stage,这些 job 顺序执行,这个在hue的运行日志中也可以看到。但是有时候这些任务之间并不是是相互依赖的,如果集群资源允许的话,可以让多个并不相互依赖stage并发执行,这样就节约了时间,提高了执行速度,但是如果集群资源匮乏时,启用并行化反倒是会导致各个job相互抢占资源而导致整体执行性能的下降。 

       启用并行化:

set hive.exec.parallel=true

hiveQL层优化

利用分区表优化

分区表是在某一个或者某几个维度上对数据进行分类存储,一个分区对应于一个目录。在这中的存储方式,当查询时,如果筛选条件里有分区字段,那么hive只需要遍历对应分区目录下的文件即可,不用全局遍历数据,使得处理的数据量大大减少,提高查询效率。 
      当一个hive表的查询大多数情况下,会根据某一个字段进行筛选时,那么非常适合创建为分区表。

利用桶表优化

桶表的概念在本教程第一步有详细介绍,就是指定桶的个数后,存储数据时,根据某一个字段进行哈希后,确定存储再哪个桶里,这样做的目的和分区表类似,也是使得筛选时不用全局遍历所有的数据,只需要遍历所在桶就可以了。

hive.optimize.bucketmapJOIN=true;
hive.input.format=org.apache.hadoop.hive.ql.io.bucketizedhiveInputFormat; 
hive.optimize.bucketmapjoin=true; 
hive.optimize.bucketmapjoin.sortedmerge=true; 

 

join优化

▪ 优先过滤后再join,最大限度地减少参与join的数据量。
▪ 小表join大表原则  
       应该遵守小表join大表原则,原因是join操作的reduce阶段,位于join左边的表内容会被加载进内存,将条目少的表放在左边,可以有效减少发生内存溢出的几率。join中执行顺序是从做到右生成job,应该保证连续查询中的表的大小从左到右是依次增加的。 
▪ join on条件相同的放入一个job 
      hive中,当多个表进行join时,如果join on的条件相同,那么他们会合并为一个
mapreduce job,所以利用这个特性,可以将相同的join on的放入一个job来节省执行时间。

select pt.page_id,count(t.url) PV 
from rpt_page_type pt join (  select url_page_id,url from trackinfo where ds='2016-10-11' ) t on pt.page_id=t.url_page_id join (  select page_id from rpt_page_kpi_new where ds='2016-10-11' ) r on t.url_page_id=r.page_id group by pt.page_id;

 

 

▪ 启用mapjoin 
       mapjoin是将join双方比较小的表直接分发到各个map进程的内存中,在map进程中进行join操作,这样就省掉了reduce步骤,提高了速度。 
▪ 桶表mapjoin 
       当两个分桶表join时,如果join on的是分桶字段,小表的分桶数时大表的倍数时,可以启用map join来提高效率。启用桶表mapjoin要启用hive.optimize.bucketmapjoin参数。
▪ Group By数据倾斜优化  
       Group By很容易导致数据倾斜问题,因为实际业务中,通常是数据集中在某些点上,这也符合常见的2/8原则,这样会造成对数据分组后,某一些分组上数据量非常大,而其他的分组上数据量很小,而在mapreduce程序中,同一个分组的数据会分配到同一个reduce操作上去,导致某一些reduce压力很大,其他的reduce压力很小,这就是数据倾斜,整个job 执行时间取决于那个执行最慢的那个reduce。 
      解决这个问题的方法是配置一个参数:set hive.groupby.skewindata=true。 
      当选项设定为true,生成的查询计划会有两个MR job。第一个MR job 中,map的输出结果会随机分布到Reduce中,每个Reduce做部分聚合操作,并输出结果,这样处理的结果是相同的Group By Key有可能被分发到不同的Reduce中,从而达到负载均衡的目的;第二个MR job再根据预处理的数据结果按照Group By Key分布到Reduce中(这个过程可以保证相同的GroupBy Key被分布到同一个Reduce中),最后完成最终的聚合操作。
▪ Order By 优化 
      因为order by只能是在一个reduce进程中进行的,所以如果对一个大数据集进行order by,会导致一个reduce进程中处理的数据相当大,造成查询执行超级缓慢。
▪ 一次读取多次插入 
▪ Join字段显示类型转换 
▪ 使用orc、parquet等列式存储格式 

Guess you like

Origin www.cnblogs.com/shimingjie/p/11944955.html