Article directory
-
- 1. Introduction
-
- Case: Based on the requirement of "counting users who have logged in for more than 3 consecutive days".
-
- data preparation
- Option 1: Conventional thinking
-
- For grouping the data user_id, sort according to the user's activity date
- Use the login date and rn to find date_sub, if the obtained difference dates are equal, it means that these two days must be consecutive
- Grouped according to user_id and date difference sub_date, the number of logins is count(1) after grouping
- Solution 2: Use lag and lead functions
-
- Use the LEAD and LAG functions to find the date before and after 1 day
- For each user, if the difference between the date of the previous day and the next day and the current date = 1, it belongs to continuous login.
- For user grouping, the datediff function calculates the days of the maximum activity time and the minimum activity time, and calculates the users >=3 days
- Comparing Scheme 1 and Scheme 2
Article directory
- 1. Introduction
-
- Case: Based on the requirement of "counting users who have logged in for more than 3 consecutive days".
-
- data preparation
- Option 1: Conventional thinking
-
- For grouping the data user_id, sort according to the user's activity date
- Use the login date and rn to find date_sub, if the obtained difference dates are equal, it means that these two days must be consecutive
- Grouped according to user_id and date difference sub_date, the number of logins is count(1) after grouping
- Solution 2: Use lag and lead functions
-
- Use the LEAD and LAG functions to find the date before and after 1 day
- For each user, if the difference between the date of the previous day and the next day and the current date = 1, it belongs to continuous login.
- For user grouping, the datediff function calculates the days of the maximum activity time and the minimum activity time, and calculates the users >=3 days
- Comparing Scheme 1 and Scheme 2
- As a big data developer, you must never leave behind your SQL capabilities.
1. Introduction
When we work in ETL or in big data interviews, SQL is often torn by hand, and common SQL has continuous login problems. The general question is "Statistics of XX who have logged in to XX for N consecutive days".
The editor is here today to introduce two solutions for you to easily handle such SQL problems.
Many functions of mysql8.x and hive are basically satisfied. For the sake of efficiency and convenience, here is mysql as an example. Other SQL is similar. If you have any questions, you can leave a message in the comment area.
Case: Based on the requirement of "counting users who have logged in for more than 3 consecutive days".
data preparation
Execute the following code in mysql to generate the corresponding data table
-- ---------------------------- -- Table structure for user_activity -- ---------------------------- DROP TABLE IF EXISTS `user_activity`; CREATE TABLE `user_activity` ( `user_id` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL DEFAULT NULL, `activity_date` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL DEFAULT NULL ) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci ROW_FORMAT = Dynamic; -- ---------------------------- -- Records of user_activity -- ---------------------------- INSERT INTO `user_activity` VALUES ('user1', '2023-03-01'); INSERT INTO `user_activity` VALUES ('user2', '2023-03-02'); INSERT INTO `user_activity` VALUES ('user3', '2023-03-03'); INSERT INTO `user_activity` VALUES ('user4', '2023-03-04'); INSERT INTO `user_activity` VALUES ('user1', '2023-03-08'); INSERT INTO `user_activity` VALUES ('user2', '2023-03-08'); INSERT INTO `user_activity` VALUES ('user5', '2023-03-08'); INSERT INTO `user_activity` VALUES ('user6', '2023-03-08'); INSERT INTO `user_activity` VALUES ('user3', '2023-03-09'); INSERT INTO `user_activity` VALUES ('user5', '2023-03-09'); INSERT INTO `user_activity` VALUES ('user6', '2023-03-09'); INSERT INTO `user_activity` VALUES ('user7', '2023-03-09'); INSERT INTO `user_activity` VALUES ('user3', '2023-03-10'); INSERT INTO `user_activity` VALUES ('user5', '2023-03-10'); INSERT INTO `user_activity` VALUES ('user6', '2023-03-10'); INSERT INTO `user_activity` VALUES ('user7', '2023-03-10'); INSERT INTO `user_activity` VALUES ('user5', '2023-03-11'); INSERT INTO `user_activity` VALUES ('user6', '2023-03-11'); INSERT INTO `user_activity` VALUES ('user7', '2023-03-11'); INSERT INTO `user_activity` VALUES ('user6', '2023-03-12'); INSERT INTO `user_activity` VALUES ('user7', '2023-03-12'); INSERT INTO `user_activity` VALUES ('user7', '2023-03-13'); INSERT INTO `user_activity` VALUES ('user8', '2023-03-13'); INSERT INTO `user_activity` VALUES ('user7', '2023-03-14'); INSERT INTO `user_activity` VALUES ('user8', '2023-03-14'); INSERT INTO `user_activity` VALUES ('user7', '2023-03-15'); INSERT INTO `user_activity` VALUES ('user8', '2023-03-15'); INSERT INTO `user_activity` VALUES ('user8', '2023-03-16');
SELECT * FROM `user_activity`
The result is as follows:
user1 2023-03-01 user2 2023-03-02 user3 2023-03-03 user4 2023-03-04 user1 2023-03-08 user2 2023-03-08 user5 2023-03-08 user6 2023-03-08 user3 2023-03-09 user5 2023-03-09 user6 2023-03-09 user7 2023-03-09 user3 2023-03-10 user5 2023-03-10 user6 2023-03-10 user7 2023-03-10 user5 2023-03-11 user6 2023-03-11 user7 2023-03-11 user6 2023-03-12 user7 2023-03-12 user7 2023-03-13 user8 2023-03-13 user7 2023-03-14 user8 2023-03-14 user7 2023-03-15 user8 2023-03-15 user8 2023-03-16
Option 1: Conventional thinking
- 1. First group the data user_id, sort according to the user's activity date
- 2. Use the login date and rn to find date_sub, if the obtained difference dates are equal, it means that these two days must be consecutive
- For example, January 1, January 2, and January 3, 2023; the rankings are 1, 2, and 3 respectively; now use the date-is the ranking equal to December 31, 2022?
- 3. According to user_id and date difference sub_date grouping, the number of logins is count(1) after grouping
For grouping the data user_id, sort according to the user's activity date
select
user_id,
activity_date,
ROW_NUMBER() over(partition by user_id order by activity_date) as rn
from user_activity
Use the login date and rn to find date_sub, if the obtained difference dates are equal, it means that these two days must be consecutive
SELECT
user_id,
activity_date,
DATE_SUB(activity_date,INTERVAL rn DAY) as sub_date
from(
select
user_id,
activity_date,
ROW_NUMBER() over(partition by user_id order by activity_date) as rn
from user_activity
)t1
Grouped according to user_id and date difference sub_date, the number of logins is count(1) after grouping
SELECT
user_id,
min(activity_date) as min_date,
max(activity_date) as max_date,
count(1) as login_times
from(
SELECT
user_id,
activity_date,
DATE_SUB(activity_date,INTERVAL rn DAY) as sub_date
from(
select
user_id,
activity_date,
ROW_NUMBER() over(partition by user_id order by activity_date) as rn
from user_activity
)t1
)t2
group by user_id,sub_date
having login_times>=3;
- From the results, it can be seen that users 5, 6, 7, and 8 have logged in for 3 consecutive days or more
Solution 2: Use lag and lead functions
- 1. For each user_id, first use the lag and lead functions to find the date of the day before and the day after the current date
- 2. For each user, if the difference between the date of the previous day and the next day and the current date = 1, it belongs to continuous login.
- For example, January 1st, January 2nd, and January 3rd in 2023; now use date 2-the difference between before and after it is 2-1=1; 3-2=1. Whether the value is 1 or not Woolen cloth.
- 3. For user grouping, the datediff function calculates the days of the maximum activity time and the minimum activity time, and calculates the users >=3 days
Use the LEAD and LAG functions to find the date before and after 1 day
select
user_id,
LAG(activity_date,1,activity_date) over(partition by user_id order by activity_date) as lag_login_date,
activity_date as current_login_date,
LEAD(activity_date,1,activity_date) over(partition by user_id order by activity_date) as lead_login_date
from user_activity
For each user, if the difference between the date of the previous day and the next day and the current date = 1, it belongs to continuous login.
SELECT
user_id,
lag_login_date,
current_login_date,
lead_login_date
from(
select
user_id,
LAG(activity_date,1,activity_date) over(partition by user_id order by activity_date) as lag_login_date,
activity_date as current_login_date,
LEAD(activity_date,1,activity_date) over(partition by user_id order by activity_date) as lead_login_date
from user_activity
)t1
where datediff(current_login_date,lag_login_date)=1
and datediff(lead_login_date,current_login_date)=1;
For user grouping, the datediff function calculates the days of the maximum activity time and the minimum activity time, and calculates the users >=3 days
SELECT
user_id,
min(activity_date) as min_date,
max(activity_date) as max_date,
count(1) as login_times
from(
SELECT
user_id,
activity_date,
DATE_SUB(activity_date,INTERVAL rn DAY) as sub_date
from(
select
user_id,
activity_date,
ROW_NUMBER() over(partition by user_id order by activity_date) as rn
from user_activity
)t1
)t2
group by user_id,sub_date
having login_times>=3;
Comparing Scheme 1 and Scheme 2
Solution 1, the idea is very simple and easier to implement. It can be completed simply by understanding the window sorting function and basic SQL capabilities. In the difficulty level
,
option 2 has a simple idea, but it is more difficult to implement, and requires a certain grasp and proficiency in the window opening function. high difficulty