Big Data Interview Series --Hive

Hive is a data warehouse tool base for processing structured data in Hadoop

1.Hive difference between the traditional database
1, data storage location: Hive is built on top of Hadoop, Hive all the data are stored in HDFS. And the database data can be stored in the block device or local file system.
2, the data format: not defined Hive specific data format specified by the user, to specify three properties: column separator line separator, and a method to read the file data. Database storage engine defines its own data format. All data will be stored according to a certain organization.
3, data update: Hive content is read write less, therefore, does not support the rewrite and deletion of data, data-determined at the time of the loaded. Data in the database are usually require frequent changes.
4, execution delay: Hive query data, the need to scan the entire table (or partition), thus delaying higher only in the processing of large data is only advantage. Database data processing execution small low latency.
5, the index: Hive No, the database has
6 execution: Hive is MapReduce, database Executor
7, scalability: high Hive, low database
8, the data size: Hive large and small database

The difference between the internal and outer tables of 2.Hive of
the internal table when deleting, Hive will belong to the table metadata and data are all deleted, and delete the external table when, Hive just delete the external table metadata, data It is not deleted.

3.Hive partition tables and sub-tables barrel difference
partition:
Hive partition using HDFS subdirectory function realization. Each subdirectory contains a value corresponding to the partition and column names of each column
partitioning Hive: Because Hive is actually stored on the abstract of HDFS, Hive is a partition name corresponds to a directory name, sub-partition name is the subdirectory name, and not an actual field.
Kit of parts:
min bucket table is based on the table or the partition table on the further table organization, Hive using values kit of parts used;
for hash, the number and dividing the tub with hash results do modulo arithmetic manner sub-barrel, to ensure that each bucket has data, but the number of pieces of data in each bucket is not necessarily equal.

There are three major differences:
1. Partition table corresponds to the directory on HDFS, sub-barrel table corresponds to the file on HDFS.
2. The kit of parts divided randomly database, the database partitioning is non-random split. Because the barrel is sub-divided in accordance with the hash function sequences, relatively average; the column values in accordance with a partition to divide easily cause data skew.
3. The kit of parts corresponding different file (fine-grained), the corresponding partition is different folder (coarse-grained); tub is more fine-grained range division data, the score points of the tub region higher efficiency of query processing, make more efficient sampling.

4.Hive table metadata storage
Hive element supports three different storage server, respectively: the embedded element storage server, the local server storage element, the storage element remote server, each storage using different configuration parameters.
1. The embedded memory is mainly used for the unit test element, in this mode, only one process may be connected to the storage element, the Derby is the default database stored in the embedded element.
2. In local mode, each client Hive will open a connection to the data store and requests the SQL query on the connection.
3. In the remote mode, all Hive client opens a connection to the server metadata, the metadata server query sequence, Thrift protocol used for communication between the client and server metadata.

5.Hive in order by, sort by, distribute by , cluster by the difference
1.order by: order by input will make the global order, so only a Reducer (more Reducer can not guarantee the global order), but only a Reducer It will result when the input of the larger, longer consume computing time.
2.sort by: sort by not sorted globally, that sort of data is completed before entering the reducer, sort by order will ensure that the output of each reducer, does not guarantee the global order. sort by data in the data can only be guaranteed the same can reduce the sort by the specified field.
3.distribute by: distribute by how the control is split in the map data to reduce side end. The hive will later distribute by column corresponding to reduce the number of distributed, using the default hash algorithm. sort by generating a file for each sorting reduce. In some cases, you need to control to which a particular line should reducer, which is usually gathered for subsequent operations. distribute by just to do it. Therefore, distribute by and sort by often used in conjunction.
4.cluster by: cluster by addition to the functions also distribute by both sort by function. But only sort of flashback sort, you can not specify the collation for the ASC or DESC.

6. Hive talk about optimization
1.Fetch gripping means, Hive query for some cases may not necessarily use MapReduce computation, hive.fetch.task.conversion default hive-default.xml.template file is more, the old version hive default is minimal, the property was later modified more, in the global search, find the field, limit and so do not look away mapreduce.
2. When the size of the table join, a small table on the left, can effectively avoid OOM.
3.MapJoin
if the condition does not meet or do not specify MapJoin MapJoin, then the parser will Hive converted into the Common Join Join operation, namely: Reduce the completed join stage. Data skew easily occurs. MapJoin the small table can all be loaded into memory map join at the end, to avoid the processing reducer
open MapJoin parameters:
(1) automatically setting Mapjoin
SET = hive.auto.convert.join to true; defaults to true
(2) a large table threshold setting small table (the default 25M considered small table):
the sET hive.mapjoin.smalltable.filesize = 25000000;
4.Group by
default, Map key phases of the same data distributed to a reduce, when a key data too when it is a big tilt.
Not all polymerization operations need to be done in Reduce end, a lot of polymerization operations can be carried out first partially polymerized in the Map end, the final result in the conclusion that Reduce end.
Map open end of the polymerization parameters
(1) whether the end of the polymerization in the Map, the default is True
= to true hive.map.aggr
(2) operating in a polymerization number of entries Map end
hive.groupby.mapaggr.checkinterval = 100000
(. 3) is inclined when the data load balancing (defaults to false)
hive.groupby.skewindata = true
when the option is set to true, the resulting query plans have two MR Job. The first MR Job, Map output result will be randomly distributed to Reduce each polymerization operation Reduce do section, and outputs the result, a result of this process is the same as Group By Key likely to be distributed to different Reduce in so as to achieve load balancing purposes; second MR Job preprocessing data then results by Group by Key Reduce the distributed (this process can ensure the same Group by Key are distributed to Reduce the same), and finally complete the final polymerization operation.
5.Count (distinct) optimization
when a small amount of data need not be optimized, but when the amount of data is particularly large, because of COUNT DISTINCT operations use a Reduce Task accomplished, Reduce the amount of data that need to be addressed in a too large, It will cause the entire Job is difficult to complete, generally used interchangeably first COUNT DISTINCT GROUP BY COUNT way again.
6. Filter ranks
the column processing: taking out only required not to use select *
Row: When two tables join, when the need to add a filter condition, the condition where the front added on.
7. parallel execution
Hive query will be converted into one or a plurality of stages. This stage can be MapReduce stage sampling phase, consolidation phase, limit stage. Or other stages of the process Hive implementation may be required. By default, Hive once only execution stage. However, a particular job may contain a number of stages, and these stages may not be entirely dependent on one another, that is to say some stage can be executed in parallel, so that may make the whole job execution time is shortened. However, if there are more stages can be executed in parallel, the job may be completed faster.
Hive.exec.parallel by setting parameter is true, it can be opened concurrently. However, in a shared cluster, we need to pay attention to the next, if the job in the parallel phase increase, then the cluster utilization will increase.

set hive.exec.parallel=true;              //打开任务并行执行
set hive.exec.parallel.thread.number=16;  //同一个sql允许最大并行度,默认为8。

8. Strict mode
Hive provides a strict mode, it can prevent users to perform queries that might adversely affected not the intention.
By setting the default attribute value of a non-strict model hive.mapred.mode nonstrict. Turn on strict mode requires modification hive.mapred.mode is strict, strict mode is turned on three types of queries can be disabled.
1) For the partition table, partition field unless the filter condition where clause to limit the scope of containing, or not allowed. In other words, the user is allowed to scan all partitions. Be reason for this restriction is that the partition table usually have very large data sets, and data increasing rapidly. No partition limit query can consume unacceptably huge resources to deal with this table.
2) For the query order by the statement, which calls for the use of limit statement. Because the order by order to perform the sorting process will be distributed to all the resulting data are processed in the same Reducer, force users to increase the LIMIT statement can prevent Reducer additional execution for a long time.
3) limit the Cartesian product of a query. Very understanding of relational database users may expect not to use JOIN ON statement in the implementation of the query but uses where statements execute such a relational database optimizer can be efficiently converted into the WHERE clause ON statement. Unfortunately, Hive does not perform this optimization, therefore, if the table is large enough, then the query will be uncontrollable situation.
9.SQL optimization

1.去除查询中不需要的column
2.Where条件判断等在TableScan阶段就进行过滤
3.利用Partition信息,只读取符合条件的Partition
4.Map端join,以大表作驱动,小表载入所有mapper内存中
5.调整Join顺序,确保以大表作为驱动表
6.对于数据分布不均衡的表Group by时,为避免数据集中到少数的reducer上,分成两个map-reduce阶段。
第一个阶段先用Distinct列进行shuffle,然后在reduce端部分聚合,减小数据规模,
第二个map-reduce阶段再按group-by列聚合。
7.在map端用hash进行部分聚合,减小reduce端数据处理规模。
Published 27 original articles · won praise 9 · views 20000 +

Guess you like

Origin blog.csdn.net/I_Demo/article/details/104277223