1. Define job name
SET mapred.job.name='customer_rfm_analysis_L1';
At first glance this may find their job tasks in the task list.
2. less distinct, as far as possible with the group by
Because the data will get in a reduce the resulting data skew. distinct number of data greater than 1000.
3. join a small table to put the best left
Otherwise it will lead to consume a lot of disk and memory
4. Small and large tables table join, you can map join
Small table write memory can, easy to read and write times.
5. If the part number is greater than the union all 2
Or each union section large amount of data to be split into a plurality of insert into statements
6. SQL in the common settings
--每个sql的代码都一样
SET mapred.max.split.size=256000000;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set mapreduce.map.output.compress=true;
set mapred.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.reduce.shuffle.input.buffer.percent =0.6;
set mapreduce.reduce.shuffle.parallelcopies = 5;
set hive.exec.max.created.files=655350;
set hive.exec.max.dynamic.partitions=10000000;
set hive.exec.max.dynamic.partitions.pernode=10000000;
7. workflow
1) coordinator dynamic acquisition date
${coord:formatTime(coord:dateOffset(coord:nominalTime(),-2,'DAY'), 'yyyy-MM-dd')}
Indicates the date two days before taking the day (the format is yyyy-MM-dd)
${coord:formatTime(coord:nominalTime(), 'yyyy-MM-dd')}
Represents taking today's date (format yyyy-MM-dd)
Note: When modifying workflow, need to re-submit coordinator to take effect.
2) coordinator time setting
To choose utc time zone;
Set the time to demand execution time - 8 hours;
3) coordinator can automatically identify the workflow parameters in the call, can be assigned to it.
SQL parameters to be set to such a parameter name $ {}, if the string, '$ {String} name'
workflow parameters are set to $ {} parameter name, no quotation marks.
4) workflow parameters can upload a file. This is what file format? ?
5) halfway over, state failed.
Halting due to Out Of Memory Error...
GC overhead limit exceeded
Try Method: transfer heapsize large oozie solve!
6) concurrent with each other sub-workflow can not be executed successfully, the status is Succeeded. But in fact the task is not complete. Because there are adjacent sub-workflow errors
GC overhead limit exceeded Closing: 0: jdbc:hive2://spark-02:10000/default Intercepting System.exit(2) Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.Hive2Main], exit code [2]
8 oozie can be called Hive QL, Spark, Java, Shell
9 Bundle with workflow, coordinator in the same setting area.
It can be packed into a plurality of coordinator.