HiveSQL a little trick a day: How to design a 1-180 day registration and active retention table?

0 needs

There is an existing user active table user_active(user_id,active_date), user registry user_regist(user_id,regist_date),

The partition fields in the table are all dt (yyyy-MM-dd), and the user fields are all user_id;

Design a 1-180 day registration and active retention form;

1 analysis

The requirements require the design of registration and active retention tables with a retention period of 1-180 days, that is, the goals are as follows:

registration date

retention cycle

active number

number of registrations

retention rate

2023-01-10

1

100

200

Active Number/Registered Number

2023-01-10

2

50

200

2023-01-10

3

10

200

.....

...

...

Main inspection point: Cartesian set (one-to-many association)

Observing the table structure: we can see that the denominator is fixed for each day, and the numerator changes with the retention period

Step 1: Find the number of daily registrations in the registration form. The number of registrations is used as the denominator, which is a fixed value for each day, because we use the window to solve this indicator.

select user_id

             ,to_date(regist_date) as regist_date

             ,count(user_id) over(partition by to_date(regist_date)) as regist_count

 from user_regist

 where dt >= date_sub(current_date(), 180)

Step 2: The user registry is used as the main table, associated with the active table, and the associated key is user_id. Due to the one-to-many relationship, a Cartesian set is generated

Note: active user table, users will be active multiple times a day, pay attention to deduplication

select  regist_date
       ,t1.user_id
       ,t1.regist_count  
       ,t2.user_id
       ,t2.active_date
       ,datediff(t2.active_date, t1.regist_date) as date_diff
from (

       select user_id

             ,to_date(regist_date) as regist_date

             ,count(user_id) over(partition by to_date(regist_date)) as regist_count

       from user_regist

       where dt >= date_sub(current_date(), 180)

   ) t1

   left join (

       select user_id

             ,to_date(active_date) as active_date

       from user_active

       where dt >= date_sub(current_date(), 180)

       group by user_id, to_date(active_date)

   ) t2 
on t1.user_id = t2.user_id

regist_date

t1.user_id

t1.regist_count

t2.user_id

t2.active_date

date_diff

2023-01-10

A

200

A

2023-01-11

1

2023-01-10

A

200

A

2023-01-12

2

2023-01-10

A

200

A

2023-01-13

3

2023-01-10

A

200

A

2023-01-14

4

2023-01-10

B

200

B

2023-01-13

3

2023-01-10

B

200

B

2023-01-14

4

2023-01-10

B

200

B

2023-01-15

5

2023-01-10

B

200

B

2023-01-16

6

Step 3: Group by registration date and retention period, and calculate the number of active users under the retention period and at that point in time

 select t1.regist_date

         ,max(t1.regist_count) as regist_cnt --每天是固定值,用max()函数取出该值

         ,datediff(t2.active_date, t1.regist_date) as date_diff

         ,count(t1.user_id) as active_user_cnt

   from (

       select user_id

             ,to_date(regist_date) as regist_date

             ,count(user_id) over(partition by to_date(regist_date)) as regist_count

       from user_regist

       where dt >= date_sub(current_date(), 180)

   ) t1

   left join (

       select user_id

             ,to_date(active_date) as active_date

       from user_active

       where dt >= date_sub(current_date(), 180)

       group by user_id, to_date(active_date)

   ) t2 on t1.user_id = t2.user_id

   where datediff(t2.active_date, t1.regist_date) >=1

   and datediff(t2.active_date, t1.regist_date) <= 180

   group by t1.regist_date, datediff(t2.active_date, t1.regist_date)

Step Four: Calculate Rate Retention

select regist_date

     , date_diff
     
     , active_user_cnt

     , case when nvl(regist_cnt,0)!=0
            then active_user_cnt/regist_cnt end as retention_rate        
from 
(select t1.regist_date

         ,max(t1.regist_count) as regist_cnt --每天是固定值,用max()函数取出该值

         ,datediff(t2.active_date, t1.regist_date) as date_diff

         ,count(t1.user_id) as active_user_cnt

   from (

       select user_id

             ,to_date(regist_date) as regist_date

             ,count(user_id) over(partition by to_date(regist_date)) as regist_count

       from user_regist

       where dt >= date_sub(current_date(), 180)

   ) t1

   left join (

       select user_id

             ,to_date(active_date) as active_date

       from user_active

       where dt >= date_sub(current_date(), 180)

       group by user_id, to_date(active_date)

   ) t2 on t1.user_id = t2.user_id

   where datediff(t2.active_date, t1.regist_date) >=1

   and datediff(t2.active_date, t1.regist_date) <= 180

   group by t1.regist_date, datediff(t2.active_date, t1.regist_date)
) t

2 Summary

This paper presents a calculation model for the 1-180 day registration active retention table, which is mainly solved in the form of a Cartesian set. This is also a method often used in data reports and needs to be mastered.

Guess you like

Origin blog.csdn.net/godlovedaniel/article/details/128884709