Hive's multi-granularity calculation optimizes the function of column conversion to lateral view

 Today I saw a hive optimization idea of ​​reducing the number of jobs, which is worth learning from

 The test table and data are as follows: 

create table sale_order as 
select '河南省' as province,'洛阳市' as city,'汝阳县' as country, 100 as sales
union all 
select '河南省' as province,'洛阳市' as city,'伊川县' as country, 200 as sales
union all 
select '河南省' as province,'洛阳市' as city,'汝阳县' as country, 300 as sales
union all 
select '河南省' as province,'许昌市' as city,'许昌县' as country, 500 as sales 

The sales amount needs to be calculated according to the granularity of provinces, cities, districts and counties

The first conventional idea, union all after a separate summary 

--基本写法 4个job 
select '1' as id,province,sum(sales) as sales from sale_order group by province
union all 
select '2' as id,city,sum(sales) as sales from sale_order group by city
union all 
select '3' as id,country,sum(sales) as sales from sale_order group by country  

The log looks like this

hive> select '1' as id,province,sum(sales) as sales from sale_order group by province
    > union all 
    > select '2' as id,city,sum(sales) as sales from sale_order group by city
    > union all 
    > select '3' as id,country,sum(sales) as sales from sale_order group by country;
Query ID = root_20210319133916_04d285a4-0e45-46dd-9313-aaef32a8a946
Total jobs = 4
Launching Job 1 out of 4
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1616110053329_0007, Tracking URL = http://hadoop:8088/proxy/application_1616110053329_0007/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1616110053329_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-03-19 13:40:02,368 Stage-1 map = 0%,  reduce = 0%
2021-03-19 13:40:51,683 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 21.16 sec
2021-03-19 13:41:18,388 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 24.51 sec
2021-03-19 13:41:19,676 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 25.79 sec
MapReduce Total cumulative CPU time: 25 seconds 790 msec
Ended Job = job_1616110053329_0007
Launching Job 2 out of 4
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1616110053329_0008, Tracking URL = http://hadoop:8088/proxy/application_1616110053329_0008/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1616110053329_0008
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2021-03-19 13:41:51,501 Stage-3 map = 0%,  reduce = 0%
2021-03-19 13:42:08,663 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 4.0 sec
2021-03-19 13:42:23,783 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 6.46 sec
MapReduce Total cumulative CPU time: 6 seconds 460 msec
Ended Job = job_1616110053329_0008
Launching Job 3 out of 4
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1616110053329_0009, Tracking URL = http://hadoop:8088/proxy/application_1616110053329_0009/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1616110053329_0009
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 1
2021-03-19 13:42:50,714 Stage-4 map = 0%,  reduce = 0%
2021-03-19 13:43:07,800 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 2.69 sec
2021-03-19 13:43:22,628 Stage-4 map = 100%,  reduce = 100%, Cumulative CPU 5.25 sec
MapReduce Total cumulative CPU time: 5 seconds 250 msec
Ended Job = job_1616110053329_0009
Launching Job 4 out of 4
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1616110053329_0010, Tracking URL = http://hadoop:8088/proxy/application_1616110053329_0010/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1616110053329_0010
Hadoop job information for Stage-2: number of mappers: 3; number of reducers: 0
2021-03-19 13:43:50,093 Stage-2 map = 0%,  reduce = 0%
2021-03-19 13:44:51,565 Stage-2 map = 0%,  reduce = 0%
2021-03-19 13:45:07,140 Stage-2 map = 67%,  reduce = 0%, Cumulative CPU 18.43 sec
2021-03-19 13:45:10,262 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 19.89 sec
MapReduce Total cumulative CPU time: 19 seconds 890 msec
Ended Job = job_1616110053329_0010
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 25.79 sec   HDFS Read: 7128 HDFS Write: 128 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 6.46 sec   HDFS Read: 7106 HDFS Write: 160 SUCCESS
Stage-Stage-4: Map: 1  Reduce: 1   Cumulative CPU: 5.25 sec   HDFS Read: 7102 HDFS Write: 191 SUCCESS
Stage-Stage-2: Map: 3   Cumulative CPU: 19.89 sec   HDFS Read: 9650 HDFS Write: 151 SUCCESS
Total MapReduce CPU Time Spent: 57 seconds 390 msec
OK
3	伊川县	200
3	汝阳县	400
3	许昌县	500
2	洛阳市	600
2	许昌市	500
1	河南省	1100
Time taken: 356.641 seconds, Fetched: 6 row(s)

As you can see, 4 jobs took a total of 356.641 seconds

The second method is to use the column-to-row function lateral view to generate the columns that need to be summarized into one column, and then summarize

--写法二  1个job
   select type,code,sum(sales) from (
    select type,code,
		   sales from sale_order
    lateral view explode(str_to_map(concat('1=',province,'&2=',city,'&3=',country
	                     ),'&','=')
						 ) adtable as type,code 
  ) df group by type,code

The log looks like this: 

hive> select type,code,sum(sales) from (select type,code,sales from sale_order lateral view explode(str_to_map(concat('1=',province,'&2=',city,'&3=',country),'&','=')) adtable as type,code ) df group by type,code;
Query ID = root_20210319134854_767b4eea-86f5-4763-88f8-a197dcbb4fa8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1616110053329_0011, Tracking URL = http://hadoop:8088/proxy/application_1616110053329_0011/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1616110053329_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-03-19 13:49:18,665 Stage-1 map = 0%,  reduce = 0%
2021-03-19 13:49:43,007 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.93 sec
2021-03-19 13:50:04,839 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 9.49 sec
2021-03-19 13:50:07,697 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.07 sec
MapReduce Total cumulative CPU time: 11 seconds 70 msec
Ended Job = job_1616110053329_0011
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 11.07 sec   HDFS Read: 11114 HDFS Write: 151 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 70 msec
OK
1	河南省	1100
2	洛阳市	600
2	许昌市	500
3	伊川县	200
3	汝阳县	400
3	许昌县	500
Time taken: 75.882 seconds, Fetched: 6 row(s)

It can be seen that there is a total of 1 job, which takes a total of 75.882 seconds, which improves efficiency

      If you find any problems, please correct me.

 

Guess you like

Origin blog.csdn.net/zhaoxiangchong/article/details/115012963