25. SQL data analysis practice (9 SQL questions of medium difficulty)

Topic 1: App usage frequency analysis

There is an existing app time table middle_app_login for users, and the data in the middle_app_login table is shown in the following table:

mysql> SELECT * FROM middle_app_login;
-- user_id(用户ID):VARCHAR start_time(登录App时间):DATETIME end_time(退出App时间):DATETIME
+---------+---------------------+---------------------+
| user_id | start_time          | end_time            |
+---------+---------------------+---------------------+
| u001    | 2021-04-01 10:12:30 | 2021-04-01 11:13:21 |
| u002    | 2021-04-02 08:40:21 | 2021-04-02 10:13:41 |
| u003    | 2021-04-02 15:31:01 | 2021-04-02 15:54:42 |
| u001    | 2021-04-04 13:25:40 | 2021-04-04 17:52:46 |
| u003    | 2021-04-06 07:10:20 | 2021-04-06 08:03:15 |
| u001    | 2021-04-09 18:20:34 | 2021-04-09 18:23:58 |
| u001    | 2021-04-10 14:25:55 | 2021-04-10 15:01:25 |
+---------+---------------------+---------------------+
7 rows in set (0.00 sec)

[Question 1] According to the table, calculate the average time between each user exiting the App and the next login to the App. If the user has only logged in to the App once, it will not be counted. The unit of the average time required to be output is minutes, and It is rounded to one decimal place. The output content includes: user_id (user ID), avg_minute (average interval time), and the result sample is shown in the figure below:
insert image description here
[Analysis of Topic 1] This question uses the LEAD() function to group and sort the time of each user logging in to the App, and generate a new to construct a table structure in which the time of the last exit from the App and the time of the next login to the App are in the same row, which is convenient for later processing. Then filter out non-empty rows, use the TIMESTAMPDIFF() function to calculate the minute difference between start_time_lead and end_time, calculate the average value, and round it to one decimal place to get the result. Knowledge points involved: subqueries, date/time processing functions, window functions, null value processing, decimal retention, group aggregation. The reference code is as follows:

mysql> -- ① 按照解析的写法
mysql> SELECT user_id
    ->      , ROUND(AVG(TIMESTAMPDIFF(MINUTE, end_time, start_time_lead)), 1) AS avg_minute
    -> FROM (SELECT user_id
    ->            , start_time
    ->            , end_time
    ->            , LEAD(start_time, 1) OVER (PARTITION BY user_id ORDER BY start_time) AS start_time_lead
    ->       FROM middle_app_login) a
    -> WHERE start_time_lead IS NOT NULL
    -> GROUP BY user_id;
+---------+------------+
| user_id | avg_minute |
+---------+------------+
| u001    |     4293.3 |
| u003    |     5235.0 |
+---------+------------+
2 rows in set (0.00 sec)

mysql> -- ② 第二种写法
mysql> SELECT user_id, ROUND(AVG(end_time_lag), 1) AS avg_minute
    -> FROM (SELECT a1.user_id,
    ->              TIMESTAMPDIFF(MINUTE, LAG(end_time, 1) OVER (PARTITION BY a1.user_id ORDER BY start_time), a1.start_time
    ->                  ) AS end_time_lag
    ->       FROM middle_app_login a1
    ->                INNER JOIN (SELECT user_id FROM middle_app_login GROUP BY user_id HAVING COUNT(*) > 1) a2
    ->                           ON a1.user_id = a2.user_id) a
    -> WHERE a.end_time_lag IS NOT NULL
    -> GROUP BY user_id;
+---------+------------+
| user_id | avg_minute |
+---------+------------+
| u001    |     4293.3 |
| u003    |     5235.0 |
+---------+------------+
2 rows in set (0.00 sec)

Topic 2: App download statistics

There is an App cumulative download table middle_app_download, which records the information on the cumulative download times of the App in the application product. The data in the middle_app_download table is as follows:

mysql> SELECT * FROM middle_app_download;
-- app_id(AppID):VARCHAR app_type(App类型):VARCHAR download(下载次数):INT
+--------+----------+----------+
| app_id | app_type | download |
+--------+----------+----------+
| a001   | A        |    12432 |
| a002   | B        |     9853 |
| a003   | A        |     1924 |
| a004   | C        |     2679 |
| a005   | C        |    29104 |
| a006   | A        |    10235 |
| a007   | B        |     5704 |
| a008   | B        |     2850 |
| a009   | B        |     8235 |
| a010   | C        |     9746 |
+--------+----------+----------+
10 rows in set (0.00 sec)

[Question 2] To query the average download times of different types of apps, it is necessary to exclude the apps whose download times rank in the top 10% and bottom 10%. The output content includes: app_type (App type), avg_download (average number of downloads), and the result sample is shown in the figure below:
insert image description here
[Analysis of Question 2] Use the RANK() function to generate a new column as the download ranking (ranking), and the part As an internal subquery, and outside the subquery, use WHERE to filter out the records that meet the requirements, and group and count the average download times. Knowledge points involved: subqueries, window functions, null value processing, grouping and aggregation. The reference code is as follows:

mysql> SELECT a.app_type, AVG(a.download) as avg_download
    -> FROM (SELECT app_id, app_type, download, RANK() OVER (ORDER BY download DESC ) AS download_rank
    ->       FROM middle_app_download) a
    -> WHERE a.download_rank > (SELECT COUNT(*) FROM middle_app_download) * 0.1
    ->   AND a.download_rank < (SELECT COUNT(*) FROM middle_app_download) * 0.9
    -> GROUP BY a.app_type;

Topic 3: Finding Active Learners

There is a user learning check-in table middle_active_learning, and the data in the middle_active_learning table is as follows:

mysql> SELECT * FROM middle_active_learning;
-- user_id(用户ID):VARCHAR study_date(打卡日期):DATE
+---------+------------+
| user_id | study_date |
+---------+------------+
| u001    | 2021-04-01 |
| u002    | 2021-04-01 |
| u003    | 2021-04-03 |
| u001    | 2021-04-06 |
| u003    | 2021-04-07 |
| u001    | 2021-04-12 |
| u001    | 2021-04-13 |
| u002    | 2021-04-14 |
| u001    | 2021-04-23 |
| u002    | 2021-04-24 |
| u001    | 2021-04-26 |
| u003    | 2021-04-27 |
| u002    | 2021-04-30 |
+---------+------------+
13 rows in set (0.00 sec)

[Question 3] According to the table, count the users who learn to clock in every week in April 2021. The output includes: user_id (user ID), and the result sample is shown in the figure below:
insert image description here
[Analysis of Question 3] Use the WEEKOFYEAR function to obtain the week number, and limit the study_date to April 2021. Since the user may check in multiple times in a week, use DISTINCT Deduplication is carried out to pave the way for subsequent statistical operations. Users are grouped by GROUP BY, and the number of users who check in weekly is equal to 5 (across 5 weeks in April 2021), and users who check in every week can be obtained. Knowledge points involved: subqueries, DISTINCT, date/time processing functions. The reference code is as follows:

mysql> SELECT a.user_id
    -> FROM (SELECT DISTINCT user_id
    ->                     , WEEKOFYEAR(study_date) AS study_week
    ->       FROM middle_active_learning
    ->       WHERE study_date >= '2021-04-01'
    ->         AND study_date <= '2021-04-30') a
    -> GROUP BY a.user_id
    -> HAVING COUNT(a.study_week) = 5;

Topic 4: Product classification and sorting

There is a commodity classification table middle_commodity_classification, and the data of the middle_commodity_classification table is shown in the following table:

mysql> SELECT * FROM middle_commodity_classification;
-- current_category(商品当前分类):VARCHAR parent_category(商品父类别):VARCHAR
+------------------+-----------------+
| current_category | parent_category |
+------------------+-----------------+
|| 厨具            |
| 厨具             | 生活用品        |
|| 餐具            |
| 水果刀           ||
| 剔骨刀           ||
| 餐具             | 生活用品        |
| 汤碗             ||
+------------------+-----------------+
7 rows in set (0.00 sec)

【题目4】Query to obtain the sample results shown in the figure below. The output content includes: third-level categories, second-level categories, first-level categories, and root categories. The result samples are shown in the figure below:
insert image description here
[Analysis of Question 4] This question is about sorting out the relationship between categories, and the displayed result samples include The 4-layer category relationship needs to be realized through the self-join of 3 tables. Knowledge points involved: self-connection. The reference code is as follows:

mysql> SELECT m1.current_category AS '三级类目',
    ->        m1.parent_category  AS '二级类目',
    ->        m2.parent_category  AS '一级类目',
    ->        m3.parent_category  AS '根目录'
    -> FROM middle_commodity_classification m1,
    ->      middle_commodity_classification m2,
    ->      middle_commodity_classification m3
    -> WHERE m1.parent_category = m2.current_category
    ->   AND m2.parent_category = m3.current_category;

Topic 5: Merchandise Sales Analysis

There is a commodity information table middle_commodity_info, which records the basic information of commodities, and the middle_commodity_info data is as follows:

mysql> SELECT * FROM middle_commodity_info;
-- sku_id(商品SKU):VARCHAR commodity_category(商品类别):VARCHAR director(商品销售负责人):VARCHAR
+--------+--------------------+----------+
| sku_id | commodity_category | director |
+--------+--------------------+----------+
| u001   | c001               | a001     |
| u003   | c002               | a001     |
| u002   | c003               | a002     |
+--------+--------------------+----------+
3 rows in set (0.00 sec)

There is also a commodity sales amount table middle_commodity_sale, which records the daily commodity sales. The middle_commodity_sale data is as follows:

mysql> SELECT * FROM middle_commodity_sale;
-- date(日期):DATE sku_id(商品SKU):VARCHAR sales(商品销售金额):INT
+------------+--------+-------+
| date       | sku_id | sales |
+------------+--------+-------+
| 2020-12-20 | u001   | 12000 |
| 2020-12-20 | u002   |  8000 |
| 2020-12-20 | u003   | 11000 |
| 2020-12-21 | u001   | 20000 |
| 2020-12-21 | u003   | 16000 |
| 2020-12-22 | u003   | 11000 |
| 2020-12-22 | u001   | 34000 |
| 2020-12-22 | u002   | 11000 |
| 2020-12-23 | u003   | 18000 |
| 2020-12-23 | u001   | 30000 |
+------------+--------+-------+
10 rows in set (0.00 sec)

[Question 5] Query the information on the two days with the highest sales volume of each commodity category in 2020 for a001, the person in charge of commodity sales. The output includes: commodity_category (commodity classification), date (date), total_sales (sales), and the result sample is shown in the figure below Shown:
insert image description here
[Question 5] The reference code is as follows:

mysql> SELECT commodity_category
    ->      , `date`
    ->      , total_sales
    -> FROM (
    ->          SELECT commodity_category
    ->               , `date`
    ->               , RANK() OVER (PARTITION BY commodity_category ORDER BY total_sales DESC) AS ranking
    ->               , total_sales
    ->          FROM (
    ->                   SELECT b.commodity_category
    ->                        , a.`date`
    ->                        , SUM(a.sales) AS total_sales
    ->                   FROM middle_commodity_sale a
    ->                            JOIN middle_commodity_info b
    ->                                 ON a.sku_id = b.sku_id
    ->                   WHERE b.director = 'a001'
    ->                     AND YEAR(a.`date`) = 2020
    ->                   GROUP BY b.commodity_category
    ->                          , a.`date`
    ->               ) c
    ->      ) d
    -> WHERE ranking <= 2;

Topic 6: Revenue statistics of online car-hailing drivers

There is an online car-hailing order table middle_car_order, which records information about a certain day’s online car-hailing order. The middle_car_order data is shown in the following table:

mysql> SELECT * FROM middle_car_order;
-- order_id(订单ID):VARCHAR driver_id(司机ID):VARCHAR order_amount(订单金额):DOUBLE
+----------+-----------+--------------+
| order_id | driver_id | order_amount |
+----------+-----------+--------------+
| o001     | d001      |         15.6 |
| o002     | d002      |         36.5 |
| o003     | d001      |         30.1 |
| o004     | d002      |         10.6 |
| o005     | d001      |         26.2 |
| o006     | d001      |         14.6 |
| o007     | d003      |         28.9 |
| o008     | d001      |          8.8 |
| o009     | d002      |         13.3 |
| o010     | d001      |         29.4 |
+----------+-----------+--------------+
10 rows in set (0.00 sec)

[Question 6] The driver’s income is 80% of the order amount (the unit of the order amount in the table is yuan). If the driver’s order quantity on the day>=5 and the total order amount>=100, he can receive an additional subsidy of 10 yuan. Please count the income of each driver on the day, and arrange the results in descending order of income and round to two decimal places. The output includes: driver_id (driver ID), total_order (total order quantity), total_income (total income), and the result sample is shown in the figure below:
insert image description here
[Question 6] The reference code is as follows:

mysql> SELECT a.driver_id,
    ->        a.total_order,
    ->        CASE
    ->            WHEN total_order >= 5 AND total_amount >= 100 THEN ROUND(total_amount * 0.8 + 10, 2)
    ->            ELSE ROUND(total_amount * 0.8, 2) END AS 'total_income'
    -> FROM (SELECT driver_id, COUNT(driver_id) AS 'total_order', SUM(order_amount) AS 'total_amount'
    ->       FROM middle_car_order
    ->       GROUP BY driver_id) a ORDER BY total_income DESC;

Topic 7: Website login time interval statistics

There is a website login table middle_login_info, which records the website login information of all users. The data of the middle_login_info table is as follows:

mysql> SELECT * FROM  middle_login_info;
-- user_id(用户ID):VARCHAR login_time(用户登录日期):DATE
+---------+------------+
| user_id | login_time |
+---------+------------+
| a001    | 2021-01-01 |
| b001    | 2021-01-01 |
| a001    | 2021-01-03 |
| a001    | 2021-01-06 |
| a001    | 2021-01-07 |
| b001    | 2021-01-07 |
| a001    | 2021-01-08 |
| a001    | 2021-01-09 |
| b001    | 2021-01-09 |
| b001    | 2021-01-10 |
| b001    | 2021-01-15 |
| a001    | 2021-01-16 |
| a001    | 2021-01-18 |
| a001    | 2021-01-19 |
| b001    | 2021-01-20 |
| a001    | 2021-01-23 |
+---------+------------+
16 rows in set (0.00 sec)

[Title 7] Calculate the number of times each user's login date interval is less than 5 days. The output includes: user_id (user ID), num (the number of times the user login date interval is less than 5 days), and the result sample is shown in the figure below:
insert image description here
[Question 7] The reference code is as follows:

mysql> SELECT a.user_id, COUNT(*) AS 'num'
    -> FROM (SELECT user_id,
    ->              login_time,
    ->              TIMESTAMPDIFF(DAY, LAG(login_time) OVER (PARTITION BY user_id ORDER BY login_time),
    ->                            login_time) AS date_diff
    ->       FROM middle_login_info) a
    -> WHERE a.date_diff < 5
    -> GROUP BY a.user_id;

Topic 8: Statistics of Commodity Revenue in Different Regions

There is a middle_sale_volume table of commodity income in different cities, which records information such as year and region. The middle_sale_volume data is shown in the following table:

mysql> SELECT * FROM middle_sale_volume;
-- year(年份):YEAR region(区域):VARCHAR city(城市):VARCHAR money(收入):INT
+------+--------+------+-------+
| year | region | city | money |
+------+--------+------+-------+
| 2018 | 东区   | A 市 |  1125 |
| 2019 | 东区   | A 市 |  1305 |
| 2020 | 东区   | A 市 |  1623 |
| 2018 | 东区   | C 市 |   845 |
| 2019 | 东区   | C 市 |   986 |
| 2020 | 东区   | C 市 |  1134 |
| 2018 | 西区   | M 市 |   638 |
| 2019 | 西区   | M 市 |  1490 |
| 2020 | 西区   | M 市 |  1120 |
| 2018 | 西区   | V 市 |  1402 |
| 2019 | 西区   | V 市 |  1209 |
| 2020 | 西区   | V 市 |  1190 |
+------+--------+------+-------+
12 rows in set (0.00 sec)

【题目8】Calculate the total income and average income of each region, and round the results to one decimal place. The output includes: year (year), total income and average income in different regions, and the result sample is shown in the figure below:
insert image description here
[Question 8] The reference code is as follows:

-- 第①种写法
mysql> SELECT a.`year`
    ->      , ROUND(SUM(IF(a.region = '东区', a.money, 0)), 1)
    ->     AS '东区总收入'
    ->      , ROUND(SUM(IF(a.region = '西区', a.money, 0)), 1)
    ->     AS '西区总收入'
    ->      , ROUND(SUM(IF(a.region = '东区', a.money, 0)) / SUM(a.east_area), 1)
    ->     AS '东区平均收入'
    ->      , ROUND(SUM(IF(a.region = '西区', a.money, 0)) / SUM(a.west_area), 1)
    ->     AS '西区平均收入'
    -> FROM (
    ->          SELECT `year`
    ->               , region
    ->               , money
    ->               , IF(region = '东区', 1, 0) AS east_area
    ->               , IF(region = '西区', 1, 0) AS west_area
    ->          FROM sale_volume
    ->          GROUP BY `year`
    ->                 , region
    ->                 , money
    ->      ) AS a
    -> GROUP BY a.`year`;
-- 第②种写法
mysql> SELECT a.year,
    ->        ROUND(a.收入, 1)   AS '东区总收入',
    ->        ROUND(b.收入, 1)   AS '西区总收入',
    ->        ROUND(a.平均收入, 1) AS '东区平均收入',
    ->        ROUND(b.平均收入, 1) AS '西区平均收入'
    -> FROM (SELECT year,
    ->              region,
    ->              SUM(money) AS '收入',
    ->              AVG(money) AS '平均收入'
    ->       FROM middle_sale_volume
    ->       GROUP BY year, region) a
    ->          INNER JOIN (SELECT year,
    ->                             region,
    ->                             SUM(money) AS '收入',
    ->                             AVG(money)    '平均收入'
    ->                      FROM middle_sale_volume
    ->                      GROUP BY year, region) b ON a.region < b.region AND a.year = b.year;

Topic 9: Statistics on overdue credit

There is a user loan situation table middle_credit_overdue, and the data in the middle_credit_overdue table is as follows:

mysql> SELECT * FROM middle_credit_overdue;
-- user_id(用户ID):VARCHAR overdue_date(贷款逾期日期):DATE
+---------+--------------+
| user_id | overdue_date |
+---------+--------------+
| u001    | 2020-10-20   |
| u002    | 2020-11-03   |
| u003    | 2020-10-04   |
| u004    | 2021-01-05   |
| u005    | 2021-01-15   |
| u006    | 2020-09-04   |
| u007    | 2021-01-03   |
| u008    | 2020-12-24   |
| u009    | 2020-12-10   |
+---------+--------------+
9 rows in set (0.00 sec)

[Question 9] The statistical date is as of January 20, 2021, the number of samples that are overdue for 1-29 days, 30-59 days overdue and over 60 days overdue in different overdue months. The output includes: overdue_month (overdue month), 1~29 days overdue, 30~59 days overdue, and over 60 days overdue. The result sample is shown in the figure below: [Question 9] The reference code is as follows
insert image description here
:

-- 第①种写法参考:
mysql> SELECT LEFT(overdue_date, 7),
    ->        SUM(CASE
    ->                WHEN TIMESTAMPDIFF(DAY, overdue_date, '2021-01-20') BETWEEN 1 AND 29 THEN 1
    ->                ELSE 0 END) AS '逾期1-29天',
    ->        SUM(CASE
    ->                WHEN TIMESTAMPDIFF(DAY, overdue_date, '2021-01-20') BETWEEN 30 AND 59 THEN 1
    ->                ELSE 0 END) AS '逾期30-59天',
    ->        SUM(CASE
    ->                WHEN TIMESTAMPDIFF(DAY, overdue_date, '2021-01-20') > 60 THEN 1
    ->                ELSE 0 END) AS '逾期60天以上'
    -> FROM middle_credit_overdue
    -> GROUP BY LEFT(overdue_date, 7)
    -> ORDER BY LEFT(overdue_date, 7)
    ->         DESC;
-- 第②种写法参考:
mysql> SELECT overdue_month
    ->      , COUNT(CASE
    ->                  WHEN overdue_days >= 1 AND overdue_days < 30
    ->                      THEN user_id END)
    ->     AS '逾期 1-29 天'
    ->      , COUNT(CASE
    ->                  WHEN overdue_days >= 30 AND overdue_days < 60
    ->                      THEN user_id END)
    ->     AS '逾期 30-59 天'
    ->      , COUNT(CASE
    ->                  WHEN overdue_days >= 60
    ->                      THEN user_id END)
    ->     AS '逾期 60 天以上'
    -> FROM (
    ->          SELECT user_id
    ->               , DATE_FORMAT(overdue_date, '%Y-%m') AS overdue_month
    ->               , DATEDIFF('2021-01-20', overdue_date)
    ->                                                    AS overdue_days
    ->          FROM middle_credit_overdue
    ->      ) a
    -> GROUP BY overdue_month
    -> ORDER BY overdue_month DESC;

So far, today's study is over. The author declares here that the author writes the article only to learn and communicate, and to let more readers who study the database avoid some detours, save time, and do not use it for other purposes. If there is any infringement, contact The blogger can be deleted. Thank you for reading this blog post, I hope this article can become a leader on your programming journey. Happy reading!


insert image description here

    A good book does not tire of reading a hundred times, and the child knows himself when he is familiar with the class. And if I want to be the most beautiful boy in the audience, I must persist in acquiring more knowledge through learning, change my destiny with knowledge, witness my growth with blogs, and prove that I am working hard with actions.
    If my blog is helpful to you, if you like the content of my blog, please 点赞, 评论,收藏 click three links! I heard that those who like it will not have bad luck, and they will be full of energy every day! If you really want to prostitute for nothing, then I wish you happy every day, welcome to visit my blog often.
 Coding is not easy, everyone's support is the motivation for me to persevere. Don't forget 关注me after you like it!

Guess you like

Origin blog.csdn.net/xw1680/article/details/130592344