Pandas uses 6 or 6, try this question and you can see

This article has participated in the "Newcomer Creation Ceremony" event to start the road of gold creation together.

Introduction: Recently, I encountered such a practical problem of data processing in practical work. With my LeetCode200+ algorithm questions and the foundation of Pandas skillful use for one year, I quickly completed it. Here is a summary, for future reference!

Topic description : Given the start and end schedules of multiple behaviors of a group of users, since there may be overlaps between adjacent behaviors (that is, the start time of the next behavior may be earlier than the end time of the previous behavior), it is necessary to match the behaviors according to the user ID. Its corresponding start and end time information is merged. Without loss of generality, the simulated sample data is as follows:

picture

In the above example data, there is a certain start and end time overlap between the multiple sets of behaviors of user A and user B. For example, the start and end times of the two behaviors of user A are [3, 6] and [4, 7] respectively (at the same time, The start time sequence of the two groups of behaviors here is still wrong), there is overlap, so they can be combined into [3, 7]; similarly, the start and end times of the two behaviors of user B are [4, 7] and [6, 8] ], which can also be combined into [4, 8].

In order to complete the above small requirement, it can actually be disassembled into two small problems:

  • Given the start time of multiple groups of behaviors of the same user, complete the interval merging problem according to the size of the start and end times. Actually, this is an original question from LeetCode

picture

The picture comes from the screenshot of LeetCode56 question

  • On the basis of completing the single user's interval merging, how to deal with the multi-user's interval merging and the splicing of the final result. In terms of Pandas thinking, it is naturally the process of groupby: split—aggregate (range combine)—union

First of all, the first small problem is not difficult, you can directly implement a custom function. The sample code is as follows. The premise of the normal execution of the function is that the starts have been sorted in order from small to large. Of course, this detail is in pandas. It's easy to implement.

 1def range_combine(starts, ends):
 2    # 在starts有序的前提下,完成区间合并
 3    combines = []
 4    for start, end in zip(starts, ends):
 5        if not combines or start > combines[-1][1]:
 6            combines.append([start, end])
 7        else:
 8            combines[-1][1] = max(combines[-1][1], end)
 9    return combines
10# 测试样例
11starts = [1, 3, 4, 8]
12ends = [2, 6, 7, 9]
13range_combine(starts, ends)
14# 输出 [[1, 2], [3, 7], [8, 9]]

复制代码

In order to realize the second small function, certain skills are required. What is certain is that in order to implement interval merging by user grouping, groupby('uid') must be performed, and then range_combine is executed on each grouper to get each user and all the merged interval nested lists, and then the problem is transformed into How can I sub-split this nested list into multiple lines. This involves a useful API in Pandas-explode, which splits a sequence into multiple lines. As can be seen from the following explode function documentation, it receives one or more column names as parameters (that is, to split column), when the value of the column is a list-type element, it can be split, and the remaining elements in the row can be copied multiple times to realize the splitting process.

picture

Furthermore, the process of splitting multiple behavior start and end intervals of each user into multiple lines can be completed, and the specific implementation is as follows:

picture

So far, most of the function implementation has been completed, only the last step is left, that is to split the start and end times of each user's previous merged behaviors into two columns, which represent the start and end times respectively. In this process, you can directly call pd.Series to achieve the Just name it. Finally, the complete implementation process of pandas code for this requirement is given:

picture

A practical requirement corresponds to multiple data processing tips. This is really a real knowledge in practice!

Guess you like

Origin juejin.im/post/7100944968101363726