Handling SQL: Taking "counting users who have logged in for more than 3 consecutive days" as an example to handle the same type of SQL requirements


  • As a big data developer, you must never leave behind your SQL capabilities.

1. Introduction

When we work in ETL or in big data interviews, SQL is often torn by hand, and common SQL has continuous login problems. The general question is "Statistics of XX who have logged in to XX for N consecutive days".
The editor is here today to introduce two solutions for you to easily handle such SQL problems.

Many functions of mysql8.x and hive are basically satisfied. For the sake of efficiency and convenience, here is mysql as an example. Other SQL is similar. If you have any questions, you can leave a message in the comment area.

Case: Based on the requirement of "counting users who have logged in for more than 3 consecutive days".

data preparation

Execute the following code in mysql to generate the corresponding data table

-- ----------------------------
-- Table structure for user_activity
-- ----------------------------
DROP TABLE IF EXISTS `user_activity`;
CREATE TABLE `user_activity`  (
  `user_id` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL DEFAULT NULL,
  `activity_date` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NULL DEFAULT NULL
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci ROW_FORMAT = Dynamic;

-- ----------------------------
-- Records of user_activity
-- ----------------------------
INSERT INTO `user_activity` VALUES ('user1', '2023-03-01');
INSERT INTO `user_activity` VALUES ('user2', '2023-03-02');
INSERT INTO `user_activity` VALUES ('user3', '2023-03-03');
INSERT INTO `user_activity` VALUES ('user4', '2023-03-04');
INSERT INTO `user_activity` VALUES ('user1', '2023-03-08');
INSERT INTO `user_activity` VALUES ('user2', '2023-03-08');
INSERT INTO `user_activity` VALUES ('user5', '2023-03-08');
INSERT INTO `user_activity` VALUES ('user6', '2023-03-08');
INSERT INTO `user_activity` VALUES ('user3', '2023-03-09');
INSERT INTO `user_activity` VALUES ('user5', '2023-03-09');
INSERT INTO `user_activity` VALUES ('user6', '2023-03-09');
INSERT INTO `user_activity` VALUES ('user7', '2023-03-09');
INSERT INTO `user_activity` VALUES ('user3', '2023-03-10');
INSERT INTO `user_activity` VALUES ('user5', '2023-03-10');
INSERT INTO `user_activity` VALUES ('user6', '2023-03-10');
INSERT INTO `user_activity` VALUES ('user7', '2023-03-10');
INSERT INTO `user_activity` VALUES ('user5', '2023-03-11');
INSERT INTO `user_activity` VALUES ('user6', '2023-03-11');
INSERT INTO `user_activity` VALUES ('user7', '2023-03-11');
INSERT INTO `user_activity` VALUES ('user6', '2023-03-12');
INSERT INTO `user_activity` VALUES ('user7', '2023-03-12');
INSERT INTO `user_activity` VALUES ('user7', '2023-03-13');
INSERT INTO `user_activity` VALUES ('user8', '2023-03-13');
INSERT INTO `user_activity` VALUES ('user7', '2023-03-14');
INSERT INTO `user_activity` VALUES ('user8', '2023-03-14');
INSERT INTO `user_activity` VALUES ('user7', '2023-03-15');
INSERT INTO `user_activity` VALUES ('user8', '2023-03-15');
INSERT INTO `user_activity` VALUES ('user8', '2023-03-16');
SELECT * FROM `user_activity`

The result is as follows:

user1	2023-03-01
user2	2023-03-02
user3	2023-03-03
user4	2023-03-04
user1	2023-03-08
user2	2023-03-08
user5	2023-03-08
user6	2023-03-08
user3	2023-03-09
user5	2023-03-09
user6	2023-03-09
user7	2023-03-09
user3	2023-03-10
user5	2023-03-10
user6	2023-03-10
user7	2023-03-10
user5	2023-03-11
user6	2023-03-11
user7	2023-03-11
user6	2023-03-12
user7	2023-03-12
user7	2023-03-13
user8	2023-03-13
user7	2023-03-14
user8	2023-03-14
user7	2023-03-15
user8	2023-03-15
user8	2023-03-16

insert image description here

Option 1: Conventional thinking

  • 1. First group the data user_id, sort according to the user's activity date
  • 2. Use the login date and rn to find date_sub, if the obtained difference dates are equal, it means that these two days must be consecutive
    • For example, January 1, January 2, and January 3, 2023; the rankings are 1, 2, and 3 respectively; now use the date-is the ranking equal to December 31, 2022?
  • 3. According to user_id and date difference sub_date grouping, the number of logins is count(1) after grouping

For grouping the data user_id, sort according to the user's activity date

select
			user_id,
			activity_date,
			ROW_NUMBER() over(partition by user_id order by activity_date) as rn
from user_activity

insert image description here

Use the login date and rn to find date_sub, if the obtained difference dates are equal, it means that these two days must be consecutive

SELECT
		user_id,
		activity_date,
		DATE_SUB(activity_date,INTERVAL rn DAY) as sub_date
	from(
		select
			user_id,
			activity_date,
			ROW_NUMBER() over(partition by user_id order by activity_date) as rn
		from user_activity
	)t1

insert image description here

Grouped according to user_id and date difference sub_date, the number of logins is count(1) after grouping

SELECT
	user_id,
	min(activity_date) as min_date,
	max(activity_date)  as max_date,
	count(1) as  login_times
from(
	SELECT
		user_id,
		activity_date,
		DATE_SUB(activity_date,INTERVAL rn DAY) as sub_date
	from(
		select
			user_id,
			activity_date,
			ROW_NUMBER() over(partition by user_id order by activity_date) as rn
		from user_activity
	)t1
)t2
group by user_id,sub_date
having login_times>=3;

insert image description here

  • From the results, it can be seen that users 5, 6, 7, and 8 have logged in for 3 consecutive days or more

Solution 2: Use lag and lead functions

  • 1. For each user_id, first use the lag and lead functions to find the date of the day before and the day after the current date
  • 2. For each user, if the difference between the date of the previous day and the next day and the current date = 1, it belongs to continuous login.
    • For example, January 1st, January 2nd, and January 3rd in 2023; now use date 2-the difference between before and after it is 2-1=1; 3-2=1. Whether the value is 1 or not Woolen cloth.
  • 3. For user grouping, the datediff function calculates the days of the maximum activity time and the minimum activity time, and calculates the users >=3 days

Use the LEAD and LAG functions to find the date before and after 1 day

select
		user_id,
		LAG(activity_date,1,activity_date) over(partition by user_id order by activity_date) as lag_login_date,
		activity_date as current_login_date,
		LEAD(activity_date,1,activity_date) over(partition by user_id order by activity_date) as lead_login_date
	from user_activity

insert image description here

For each user, if the difference between the date of the previous day and the next day and the current date = 1, it belongs to continuous login.

SELECT
	user_id,
	lag_login_date,
	current_login_date,
	lead_login_date
from(
	select
		user_id,
		LAG(activity_date,1,activity_date) over(partition by user_id order by activity_date) as lag_login_date,
		activity_date as current_login_date,
		LEAD(activity_date,1,activity_date) over(partition by user_id order by activity_date) as lead_login_date
	from user_activity
)t1
where datediff(current_login_date,lag_login_date)=1 
and datediff(lead_login_date,current_login_date)=1;

For user grouping, the datediff function calculates the days of the maximum activity time and the minimum activity time, and calculates the users >=3 days

SELECT
	user_id,
	min(activity_date) as min_date,
	max(activity_date)  as max_date,
	count(1) as  login_times
from(
	SELECT
		user_id,
		activity_date,
		DATE_SUB(activity_date,INTERVAL rn DAY) as sub_date
	from(
		select
			user_id,
			activity_date,
			ROW_NUMBER() over(partition by user_id order by activity_date) as rn
		from user_activity
	)t1
)t2
group by user_id,sub_date
having login_times>=3;

insert image description here

Comparing Scheme 1 and Scheme 2

Solution 1, the idea is very simple and easier to implement. It can be completed simply by understanding the window sorting function and basic SQL capabilities. In the difficulty level
,
option 2 has a simple idea, but it is more difficult to implement, and requires a certain grasp and proficiency in the window opening function. high difficulty


Guess you like

Origin blog.csdn.net/m0_49303490/article/details/130469205