【topic】
The "Order Information Form" records the information of Brazilian passengers using taxi-hailing software, including order call, answer, cancellation, and order completion time. (Drip pen test questions)
Notice:
(1) The time in the table is Beijing time, and Brazil is 11 hours behind China.
(2) If the data value in the response time column is "1970", it means that the order has no driver's response and it is an invalid order.
question
1. What are the order response rate and order completion rate?
2. How long is the call answer time?
3. From this week's data, which hour (local time) has the highest call volume? What hour (local time) has the least call volume?
4. What is the proportion of call orders that continue to call the next day?
5. (Optional) If you want to classify passengers, what factors do you think need to be considered?
【Problem solving steps】
We first preprocess the data to convert Beijing time to Brazil time. Specifically, it needs to be implemented in two steps. First, in order to ensure that the time in the table is in a standard date format, we uniformly process the date format. Then convert the processed date to Brazilian time.
(1) Date formatting
Since in date formatting, we will need to modify the date data in the table, so consider using the update statement. The specific operation of modifying the table will involve conversion between date data types, so we consider using the cast function.
Since the time in the table should be in datetime format, it is accurate to hour, minute and second (YYYY-MM-DD HH:mm:ss). The effect after conversion is shown in the figure below.
Therefore, the following sql statements can be written.
update 订单信息表 set call_time=cast(call_time as datetime);
update 订单信息表 set grab_time=cast(grab_time as datetime);
update 订单信息表 set cancel_time=cast(cancel_time as datetime);
update 订单信息表 set finish_time=cast(finish_time as datetime);
The date formatted table is as shown below.
(2) Convert to Brazil time
Since the time in the data is Beijing time, and it is known that Brazil is 11 hours behind China, we use the date_sub function here.
So you can write the following sql statement:
update 订单信息表
set call_time= date_sub(call_time, interval 11 hour) ;
update 订单信息表
set grab_time= date_sub(grab_time, interval 11 hour) ;
update 订单信息表
set cancel_time= date_sub(cancel_time, interval 11 hour) ;
update 订单信息表
set finish_time= date_sub(finish_time, interval 11 hour) ;
The time conversion result is as follows:
According to the above operations, the data date preprocessing is completed.
1. What are the order response rate and order completion rate?
(1) Response rate
Response rate = Number of answered orders/Number of called orders
Call order: The number of call orders is equal to the total number of data in the column of call time (call_time), which can be summarized by count(call_time).
Response order: The number of response orders is equal to the total data in the response time (grab_time) column, which can be summarized by count(grab_time). It should be noted that the number of data whose value in this column is not equal to '1970' is the effective number of response orders. As shown in the figure below: the part in the red box is the response order.
According to the business requirements of the topic, different conditions need to be counted. In "Monkey Learns SQL from Zero", it is said that the condition judgment should use the case when expression. So the sql corresponding to the number of response orders is:
sum(case when grab_time <> 1970 then 1 else 0 end)
Now you can calculate the indicator answer rate = number of answered orders/number of called orders:
select sum(case when grab_time <> 1970 then 1 else 0 end)/count(call_time) as 应答率
from 订单信息表;
The query results are as follows:
(2) Order Completion Rate
Order Completion Rate = Number of Completed Orders/Number of Called Orders
Complete order: In the column of finish time (finish_time), the number of data whose value is not equal to '1970' is the effective number of completed orders. As shown in the picture below: the part in the red box is the completed order.
So the number of completed orders is:
sum(case when finish_time <> 1970 then 1 else 0 end)
Now you can calculate the index completion rate = number of completed orders / number of call orders:
select sum(case when finish_time <> 1970 then 1 else 0 end)/count(*) as 完单率
from 订单信息表;
The query results are as follows
2. How long is the call answer time?
According to the definition of indicators in the title:
Call answering time = the sum of the time from the call to the answered order / the number of answered orders
The time from the call to the response of the answered order = the time to be answered (grab_time) - the time to call (call_time).
This involves calculating the difference between two dates. "Monkey Learns SQL from Zero" mentioned that the corresponding single function is timestampdiff. The figure below is the usage of this function.
Let's go back to the topic and use the timestampdiff function to calculate the sum of the time from calling to being answered.
In summary, the analysis of the corresponding sql statement is as follows
The query results are as follows
3. From this week's data, which hour (local time) has the highest call volume? What hour (local time) has the least call volume?
(1) Time conversion
Since the title requires "which hour", we first format and convert the data into hours. Add a new column to represent the "hour" in the time, and the column name is set to call_time_hour.
-- 添加列
alter table 订单信息表 add column call_time_hour varchar(255);
Using the date_format function, which is used to display date data in different formats, will convert the data format into hours.
/**
给列添加数据
%k表示显示的是24小时制中的小时
*/
update 订单信息表
set call_time_hour=date_format(call_time,'%k');
The converted table is as shown below
(2) Which hour has the highest call volume?
Calling order is the order_id column. Group by "each hour" (group by call_time_hour), then count the call order count (order_id) for each hour, and then sort to know which hour has the highest order volume.
The following figure shows the SQL statement analysis process:
At this time, the query results are as follows
Because the title requires the maximum value after sorting (the hour with the highest call volume), you can use the limit clause to filter out the first row of data.
The sql statement is as follows:
select call_time_hour,count(order_id) as 最大次数
from 订单信息表
group by call_time_hour
order by 最大次数 desc
limit 1;
(2) Which hour has the least number of calls?
Following the above sorting results, we can see that the data of 3 call hours is the minimum number of times, and we can filter them out with limit 3.
select call_time_hour,count(order_id) as 最小次数
from 订单信息表
group by call_time_hour
order by 最小次数 asc
limit 3;
4. What is the proportion of call orders that continue to call the next day?
The proportion of call orders that continue to call the next day = the number of users who continue to call the next day/total call orders.
The idea of calculating the number of users who continue to call the next day is as follows:
Let's analyze each part in detail.
(1) Self-associated query to obtain the time interval of calls. Since we need the unit of time to be days, we use the date_format function to extract the "year month day" part of the date.
The sql statement is as follows:
-- 添加一列来显示时间中的“年月日”部分
alter table 订单信息表 add column call_time_day varchar(255);
update 订单信息表
set call_time_day=date_format(call_time,'%Y-%m-%d');
The changed table at this time is as follows:
We then use the join of the tables to calculate the number of days apart. Here, since it involves calculating the difference in the number of days apart, we use the timestampdiff function mentioned above. The unit is day.
At this time, the query results are as follows
Filter out data with a time difference of 1 day, that is, data with interval=1.
Using subquery nesting, use the above query results as a new table, filter in it, and sum. The sql statement analysis is as shown in the figure below.
At this time, the query results are as follows
Finally, we calculate the proportion of continuing calls the next day
The query results are as shown in the figure below
5. (Optional) If you want to classify the passengers in the table, what factors do you think need to be considered?
We can consider user classification from the following two perspectives.
User Behavior Classification
1) According to the completion time and order receiving time, the time spent by passengers during the ride can be roughly calculated, and this time can be predicted as long-distance, mid-distance or short-distance to analyze passengers' riding habits.
2) According to the call time, it can be judged at the time when the passenger issued the ticket, how the passenger demand was generated, and in which scenarios the user has a travel demand, such as commuting, commuting, dining, traveling, temporary and other scenarios.
User Value Classification
Use the RFM analysis method learned before to classify users by value.
Specific to this question, RFM can be defined as follows:
R:最近一次乘客的完单时间。
F:乘客打车的频率。
M:打车消费的金额。此处可以用乘车过程消耗的时长来代替等。
[Test points for this question]
1. For the processing of date data, master the common date processing methods mentioned in the topic.
2. Examine analytical thinking ability. Solve using the framework you have learned how to use data analysis to solve problems.
⬇️Click "Read the original text"
Sign up for free Data analysis training camp