Hive study notes (5)-Optimization 2

Seven JOIN optimization

Size table optimization

Size table optimization refers to when two or more tables are joined, it is necessary to ensure that the size of the table in the continuous query increases from left to right. In this way, Hive will save the small tables in memory. Hive can perform the connection process on the map side and match the small tables in memory one by one, thereby omitting the reduce process required for normal operations.

The first way of writing is to write according to the position of the size table, that is, the small table is written in the front, and the big table is written in the back. The
divideds are the small table; the stocks is the large table.

SELECT s.ymd, s.symbol,s.price_close,d,dividend
FROM divedends d JOIN stocks s ON s.ymd = d.ymd AND s.symblo = d.dymbol
WHERE s.symbol = "AAPL";  

The second is that hive does not care about the position of the large and small tables, and uses the "/ +[tablename] /" display method to mark the large tables; there is also a MAPJOIN (inner join)

SELECT /*+STREAMTABLE(s)*/ s.ymd, s.symbol,s.price_close,d,dividend
FROM stocks s JOIN divedends d ON s.ymd = d.ymd AND s.symblo = d.dymbol
WHERE s.symbol = "AAPL";          --/*+STREAMTABLE(s)*/ 是标记那张表时大表

The third is that the user can set the parameter hive automatically without marking.

set hive.auto.convert.join=true

The fourth setting the size of the small table (the default is bytes)

hive.mapjoin.smalltable.fulesize=250000

The above optimization can be comprehensively applied depending on the situation

Vacant or meaningless values.
This situation is very common. For example, when the fact table is log data, there are often some items that are not recorded. We will set it to null, or an empty string, -1, etc. depending on the situation. If there are a lot of missing items, these empty values ​​will be very concentrated when doing the join, which will slow down the progress.
Therefore, if you do not need null data, write the where statement in advance to filter it out. If you need to keep it, randomly break up the null key, for example, change the record with the user ID of null to a negative value randomly:

select a.uid,a.event_type,b.nickname,b.age
from (
  select 
  (case when uid is null then cast(rand()*-10240 as int) else uid end) as uid,
  event_type from calendar_record_log
  where pt_date >= 20190201
) a left outer join (
  select uid,nickname,age from user_info where status = 4
) b on a.uid = b.uid;

Handling skewed keys alone
is actually an extension of the above method of handling null values, but skewed keys have become meaningful. Generally speaking, there are very few skewed keys. We can sample them out, store the corresponding rows in a temporary table separately, and then prefix them with a smaller random number (such as 0-9), and finally aggregate them. The SQL statement is similar to the above, so I won't repeat it.

Different data types
This situation is not very common, and mainly occurs when the columns of the same business meaning have undergone logical changes.
For example, if we have two calendar record tables, one old and one new, the record type field of the old table is (event_type int), and the new table is (event_type string). In order to be compatible with the old version of the record, the event_type of the new table will also store the value of the old version in the form of a string, such as '17'. When these two tables are joined, it often takes a long time. The reason is that if the type is not converted, the hash value of the key is calculated by default in int type, which causes all "real" string keys to be assigned to a reducer. So pay attention to type conversion:

select a.uid,a.event_type,b.record_data
from calendar_record_log a
left outer join (
  select uid,event_type from calendar_record_log_2
  where pt_date = 20190228
) b on a.uid = b.uid and b.event_type = cast(a.event_type as string)
where a.pt_date = 20190228;

The build table is too large
Sometimes the build table is too large to use map join directly, such as a full user dimension table, and the use of ordinary join has the problem of uneven data distribution. At this time, it is necessary to make full use of the restriction conditions of the probe table to reduce the data volume of the build table, and then use map join to solve it. The cost is that two joins are required. for example:

select /*+mapjoin(b)*/ a.uid,a.event_type,b.status,b.extra_info
from calendar_record_log a
left outer join (
  select /*+mapjoin(s)*/ t.uid,t.status,t.extra_info
  from (select distinct uid from calendar_record_log where pt_date = 20190228) s
  inner join user_info t on s.uid = t.uid
) b on a.uid = b.uid
where a.pt_date = 20190228;

Eight MapReduce optimization

First recall a wave of MR's shuffle process
Insert picture description here

Nine adjust the number of mappers

The number of mappers is closely related to the number of splits of the input file org.apache.hadoop.mapreduce.lib.input.FileInputFormat. The specific logic of split division can be seen in the Hadoop source code class. No code is posted here, and it is directly described how the mapper number is determined.

  • The expected value of the mapper number can be set directly through the parameter mapred.map.tasks (default value 2), but it may not take effect, as mentioned below.

  • Let the total size of the input file be total_input_size. In HDFS, the size of a block is specified by the parameter dfs.block.size, and the default value is 64MB or 128MB. By default, the number of mappers is ```shell default_mapper_num = total_input_size / dfs.block.size.

  • The parameters mapred.min.split.size (default value 1B) and mapred.max.split.size (default value 64MB) are used to specify the minimum and maximum split size, respectively. The rules for calculating split size and number of splits are:

    MIN(mapred.max.split.size, dfs.block.size)); split_num =
    total_input_size / split_size。 ```
    
    得出mapper数:
    
    ```shell mapper_num = MIN(split_num, MAX(default_num,
    mapred.map.tasks))```
    
    
    

It can be seen that if you want to reduce the number of mappers, you should increase mapred.min.split.size appropriately, and the number of splits will be reduced. If you want to increase the number of mappers, in addition to reducing mapred.min.split.size, you can also increase mapred.map.tasks.
Generally speaking, if the input file is a small number of large files, reduce the number of mappers; if the input file is a large number of non-small files, increase the number of mappers; as for a large number of small files, refer to the section "Merge small files" below Method processing.

Ten adjust the number of reducers

The method of determining the number of reducers is much simpler than that of mapper. Use the parameter mapred.reduce.tasks to directly set the number of reducers, which is not the expected value like mapper. But if this parameter is not set, Hive will guess by itself, the logic is as follows:

  • The parameter hive.exec.reducers.bytes.per.reducer is used to set the maximum amount of data that each reducer can handle. The default value is 1G (before version 1.2) or 256M (after version 1.2).

  • The parameter hive.exec.reducers.max is used to set the maximum number of reducers for each job. The default value is 999 (before version 1.2) or 1009 (after version 1.2).

  • Get the number of reducers:
    shell reducer_num = MIN(total_input_size / reducers.bytes.per.reducer, reducers.max)。

The number of reducers is related to the number of output files. If there are too many reducers, a large number of small files will be generated, which will put pressure on HDFS. If the number of reducers is too small, each reducer has to process a lot of data, which may slow down the running time or cause OOM.

Eleven merge small files

In the input phase
, the input file format of Hive needs to be changed, that is, the parameterhive.input.format. The default value isorg.apache.hadoop.hive.ql.io.HiveInputFormat, we change itorg.apache.hadoop.hive.ql.io.CombineHiveInputFormat.
Compared with the above adjustment of the number of mappers, there will be two more parameters, namely mapred.min.split.size.per.node and mapred.min.split.size.per.rack, meaning single node and single rack The minimum split size on the top. If a split size is found to be smaller than these two values ​​(the default is 100MB), it will be merged. For specific logic, please refer to the corresponding class in the Hive source code.
Merge the output stage
directly set both hive.merge.mapfiles and hive.merge.mapredfiles to true. The former means to merge the output of the map-only task, and the latter means to merge the output of the map-reduce task.
In addition, hive.merge.size.per.task can specify the expected value of the combined file size after each task is output, and hive.merge.size.smallfiles.avgsize can specify the average threshold of all output file sizes. The default value is 1GB. If the average size is insufficient, another task is started to merge.

Twelve enable compression

Compressing the intermediate result data and output data of the job can save a lot of space with a small amount of CPU time. The compression method generally chooses Snappy, which has the highest efficiency.
To enable intermediate compression, you need to set hive.exec.compress.intermediate为trueand specify the compression method hive.intermediate.compression.codecas org.apache.hadoop.io.compress.SnappyCodec. In addition, the parameter hive.intermediate.compression.typecan select block (BLOCK) or record (RECORD) compression, and the compression rate of BLOCK is relatively high.
The configuration of output compression is basically the same, just turn it on hive.exec.compress.output.

Sixteen JVM reuse

In MR job, the default is to start a JVM every time a task is executed. If the task is very small and fragmented, it will take a long time for the JVM to start and shut down. It can be reused by adjusting the parameter mapred.job.reuse.jvm.num.tasks. For example, if this parameter is set to 5, it means that 5 tasks executed sequentially in the same MR job can reuse a JVM, reducing the startup and shutdown overhead. But it is invalid for tasks in different MR jobs.
Parallel execution and local mode

Seventeen parallel execution

Jobs that do not depend on each other in Hive can be executed in parallel, and the most typical one is multiple sub-queries union all. When the cluster resources are relatively sufficient, parallel execution can be enabled, that is, the parameter hive.exec.parallel is set to true. In addition, hive.exec.parallel.thread.number can set the number of parallel execution threads, the default is 8, which is generally sufficient.
In local mode,
Hive can also process tasks directly on one node instead of submitting tasks to the cluster for calculation. Because the overhead submitted to the cluster is eliminated, it is more suitable for tasks with small amounts of data and uncomplicated logic.
Set hive.exec.mode.local.auto to true to enable local mode. However, the total amount of input data for the task must be less than hive.exec.mode.local.auto.inputbytes.max(default value 128MB), the number of mappers must be less than hive.exec.mode.local.auto.tasks.max(default value 4), and the number of reducers must be 0 or 1, to be executed in local mode.

Eighteenth strict mode

所谓严格模式,就是强制不允许用户执行3种有风险的HiveQL语句,一旦执行会直接失败。这3种语句是:

When querying a partitioned table, it does not limit the statement of the partition column; the statement of
the Cartesian product is generated by the join of the two tables; the statement that
uses order by to sort but does not specify the limit.

要开启严格模式,需要将参数hive.mapred.mode设为strict。

Nineteen Adopt a suitable storage format

	在HiveQL的create table语句中,可以使用stored as ...指定表的存储格式。Hive表支持的存储格式有TextFile、SequenceFile、RCFile、Avro、ORC、Parquet等。

The storage format generally needs to be selected according to the business. In our practice, most tables use one of the two storage formats, TextFile and Parquet.
TextFile is the simplest storage format. It is a plain text record and is also the default format of Hive. Although its disk overhead is relatively large and query efficiency is low, it is more used as a springboard. Tables in RCFile, ORC, Parquet and other formats cannot be imported directly from files, and must be transferred by TextFile.
Parquet and ORC are both open source columnar storage formats under Apache. Columnar storage is more suitable for batch OLAP queries than traditional row storage, and it also supports better compression and encoding. We chose Parquet mainly because it supports the Impala query engine, and we have low demand for update, delete, and transactional operations.
I won’t expand on their details here, you can refer to their respective official websites:
https://parquet.apache.org/
https://orc.apache.org/

reference:

  1. "Hive Programming Guide"
    link: https://pan.baidu.com/s/15SXPvo9DIed_OuDmdwTjow
    Extraction code: wauz
  2. "Meituan Development Document"
  3. https://www.jianshu.com/p/deb4a6f91d3b

Guess you like

Origin blog.csdn.net/u013963379/article/details/90724333