Hive simple optimization; workflow debugging

1. Define job name

SET mapred.job.name='customer_rfm_analysis_L1';

At first glance this may find their job tasks in the task list.

 

2. less distinct, as far as possible with the group by

Because the data will get in a reduce the resulting data skew. distinct number of data greater than 1000.

 

3. join a small table to put the best left

Otherwise it will lead to consume a lot of disk and memory

4. Small and large tables table join, you can map join

Small table write memory can, easy to read and write times.

5. If the part number is greater than the union all 2

Or each union section large amount of data to be split into a plurality of insert into statements

6. SQL in the common settings

--每个sql的代码都一样
SET mapred.max.split.size=256000000;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set mapreduce.map.output.compress=true;
set mapred.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.reduce.shuffle.input.buffer.percent =0.6;
set mapreduce.reduce.shuffle.parallelcopies = 5;
set hive.exec.max.created.files=655350;
set hive.exec.max.dynamic.partitions=10000000;
set hive.exec.max.dynamic.partitions.pernode=10000000;

7. workflow

1) coordinator dynamic acquisition date

       ${coord:formatTime(coord:dateOffset(coord:nominalTime(),-2,'DAY'), 'yyyy-MM-dd')}

Indicates the date two days before taking the day (the format is yyyy-MM-dd)

       ${coord:formatTime(coord:nominalTime(), 'yyyy-MM-dd')}

Represents taking today's date (format yyyy-MM-dd)

Note:  When modifying workflow, need to re-submit coordinator to take effect.

2) coordinator time setting

To choose utc time zone;

Set the time to demand execution time - 8 hours;

3) coordinator can automatically identify the workflow parameters in the call, can be assigned to it.

    SQL parameters to be set to such a parameter name $ {}, if the string, '$ {String} name'

    workflow parameters are set to $ {} parameter name, no quotation marks.

4) workflow parameters can upload a file. This is what file format? ?

5) halfway over, state failed.

Halting due to Out Of Memory Error...

GC overhead limit exceeded

Try Method: transfer heapsize large oozie solve!

6) concurrent with each other sub-workflow can not be executed successfully, the status is Succeeded. But in fact the task is not complete. Because there are adjacent sub-workflow errors

GC overhead limit exceeded
Closing: 0: jdbc:hive2://spark-02:10000/default
Intercepting System.exit(2)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.Hive2Main], exit code [2]

8 oozie can be called Hive QL, Spark, Java, Shell

9 Bundle with workflow, coordinator in the same setting area.

It can be packed into a plurality of coordinator.

 

Guess you like

Origin blog.csdn.net/oZuoLuo123/article/details/87371308