Typical applications of analysis functions in hive

everyone:

  it is good! Today I saw a hive analysis function question, which is very interesting. I sorted out the answers and shared them, hoping to be useful to everyone. The requirements are as follows:

   When I first saw this question, I felt that the analysis function should be used, but I didn't know how to use it. Fortunately, in the end, it was written, and the idea is as follows:

----The table structure and data of the test table are as follows:

hive> desc sales;
OK
id                   int                                    
produce_name         string                                  
start_time           date                                    
end_time             date                                    
days                 int                                    
Time taken: 0.354 seconds, Fetched: 5 row(s)
hive> select * from sales;
OK
1 nike 2011-09-01 2011-09-05 5
2 nike 2011-09-03 2011-09-06 4
3 nike 2011-09-09 2011-09-15 7
4 oppo 2011-08-04 2011-08-05 2
5 oppo 2011-08-04 2011-08-15 12
6 vivo 2011-08-15 2011-08-21 7
7 vivo 2011-09-02 2011-09-12 11
Time taken: 0.223 seconds, Fetched: 7 row(s)


---Step 1: Find the start time and end time of the previous promotion for each start time

select id,produce_name,start_time,end_time,
lag(start_time) over(partition by produce_name order by id) before_start_time,
lag(end_time) over(partition by produce_name order by id) before_end_time
from sales;


---The first step: execution result (first step)

1 nike 2011-09-01 2011-09-05 NULL NULL
2 nike 2011-09-03 2011-09-06 2011-09-01 2011-09-05
3 nike 2011-09-09 2011-09-15 2011-09-03 2011-09-06
4 oppo 2011-08-04 2011-08-05 NULL NULL
5 oppo 2011-08-04 2011-08-15 2011-08-04 2011-08-05
6 vivo 2011-08-15 2011-08-21 NULL NULL
7 vivo 2011-09-02 2011-09-12 2011-08-15 2011-08-21


---Step 2 According to the start time and end time of the previous promotion, the start time is unified as the earliest, in preparation for the following grouping

select produce_name,start_time,max(end_time) end_time from 
(select id,produce_name,case when start_time>=before_start_time and start_time<=before_end_time then before_start_time else start_time end as start_time, end_time
from (select id,produce_name,start_time,end_time,
lag(start_time) over(partition by produce_name order by id) before_start_time,
lag(end_time) over(partition by produce_name order by id) before_end_time
from sales) t) d
group by produce_name,start_time;


--The execution result of the second step (the second step)

nike 2011-09-01 2011-09-06
nike 2011-09-09 2011-09-15
oppo 2011-08-04 2011-08-15
vivo 2011-08-15 2011-08-21
vivo 2011-09-02 2011-09-12


---Step 3 According to the combined start time, calculate the sum of the promotion days in each time period

select produce_name,start_time,end_time,datediff(end_time,start_time)+1 days from
(select produce_name,start_time,max(end_time) end_time from 
(select id,produce_name,case when start_time>=before_start_time and start_time<=before_end_time then before_start_time else start_time end as start_time, end_time
from (select id,produce_name,start_time,end_time,
lag(start_time) over(partition by produce_name order by id) before_start_time,
lag(end_time) over(partition by produce_name order by id) before_end_time
from sales) t) d
group by produce_name,start_time) e;


--Execution result (the third step)

nike 2011-09-01 2011-09-06 6
nike 2011-09-09 2011-09-15 7
oppo 2011-08-04 2011-08-15 12
vivo 2011-08-15 2011-08-21 7
vivo 2011-09-02 2011-09-12 11

--Step 4: According to the product name, find the sum of the final promotion days

select produce_name,sum(days) from
(select produce_name,start_time,end_time,datediff(end_time,start_time)+1 days from
(select produce_name,start_time,max(end_time) end_time from 
(select id,produce_name,case when start_time>=before_start_time and start_time<=before_end_time then before_start_time else start_time end as start_time, end_time
from (select id,produce_name,start_time,end_time,
lag(start_time) over(partition by produce_name order by id) before_start_time,
lag(end_time) over(partition by produce_name order by id) before_end_time
from sales) t) d
group by produce_name,start_time) e) f
group by produce_name;


---Execution result (fourth step)
 

nike 13
oppo 12
vivo 18

 

Description: Personal opinion, I hope it will be helpful to everyone!

After analysis later, this method has problems. If the continuous value involves multiple rows, there will be a problem when taking the minimum value according to the offset.

The first step is to get the last end time of each row

select id,
produce_name,
start_time,
end_time,
lag(end_time,1,start_time) over(partition by produce_name order by id) before_end_time
from sales

The effect is as follows:

 

The second step is to calculate the number of days between the start time and the last end time of the bank. Because the datediff function is a difference concept, the continuous value needs to be increased by 1, and the number of days in the interval needs to be subtracted by 1.

select 
id,
produce_name,
end_time,
start_time,
before_end_time,
datediff(end_time,start_time)+1 as cnt_all,
case when start_time<=before_end_time then 0 else datediff(start_time,before_end_time)-1 end cnt 
from (select id,
produce_name,
start_time,
end_time,
lag(end_time,1,start_time) over(partition by produce_name order by id) before_end_time
from sales ) t

The effect is as follows:

 

The result is in line with the conjecture. The start time of the third line of Nike is 9th, and the last time it ended is 6th. There is a difference of 2 days in the middle.

The third step is to sum up the continuous value and the intermediate interval value according to the product summary

select 
produce_name,
min(start_time) as start_time,
max(end_time) as end_time,
datediff(max(end_time),min(start_time))+1 cnt_all,
sum(case when start_time<=before_end_time then 0 else datediff(start_time,before_end_time)-1 end) cnt 
from (select id,
produce_name,
start_time,
end_time,
lag(end_time,1,start_time) over(partition by produce_name order by id) before_end_time
from sales ) t 
group by produce_name

The result is as follows:

The fourth step is the final script, simplified

select 
produce_name,
datediff(max(end_time),min(start_time))+1-
    sum(case when start_time<=before_end_time then 0 else datediff(start_time,before_end_time)-1 end) cnt 
from (select id,
produce_name,
start_time,
end_time,
lag(end_time,1,start_time) over(partition by produce_name order by id) before_end_time
from sales ) t 
group by produce_name

The final result is as follows:

 

      Personal opinion, please correct me.

 

Guess you like

Origin blog.csdn.net/zhaoxiangchong/article/details/78523589