Pandas basics | user tour log time merge sort

Author: Little Ming, Pandas data processing specialists, committed to helping countless data practitioners to solve data processing problems.

Statement of needs

There is a data of the user's tour log in the following format (after copying the table shown below, the same result will appear when running the following code):

import pandas as pd

df = pd.read_clipboard()
df

result:

uid start end
0 A 1 2
1 A 4 7
2 A 3 6
3 A 8 9
4 B 2 3
5 B 4 7
6 B 10 11
7 B 6 8
8 B 12 15
9 C 14 15

Among them, uid represents each user, start represents the starting time of the tour, and end represents the time of the end of the tour. From the above table, we can see that there is an overlap of the tour time. For example, the tour time of user A overlaps 3-6 and 4-7. It can be considered that the tour time is 3-7.

What we have to do now is to merge the overlapping tour time of each user together, and finally display them in chronological order.

Note: 3-4 and 4-6 are also overlapping times and can be combined into 3-6.

First merge and sort the time of a user

Take out a user's data for test operation:

tmp = df.groupby("uid").get_group('B')
tmp

result:

uid start end
4 B 2 3
5 B 4 7
6 B 10 11
7 B 6 8
8 B 12 15

Observation found that to solve this problem, we first need to sort the data according to the start time.

img

After sorting:

tmp = tmp.sort_values('start')
tmp

result:

uid start end
4 B 2 3
5 B 4 7
7 B 6 8
6 B 10 11
8 B 12 15

By observing the sorted data, we can quickly observe the rules of merging:

When the start time of the current tour record is less than or equal to the end time of the previous record, it is merged, which is very simple:

result = []
for uid, start, end in tmp.values:
    # 如果结果集中还没有数据或者当前记录的起始时间大于上一条记录的结束时间
    # 就可以直接将当前记录加入到结果集
    if not result or start > result[-1][2]:
        result.append([uid, start, end])
    else:
        # 否则,说明可以将当前记录与上一条记录合并
        # 合并方法是如果当前记录的结束时间大于上一条记录的结束时间,
        # 则上一条记录的结束时间修改为当前记录的结束时间
        result[-1][2] = max(result[-1][2], end)
tmp = pd.DataFrame(result, columns=["uid", "start", "end"])
tmp

result:

uid start end
0 B 2 3
1 B 4 8
2 B 10 11
3 B 12 15

Complete code

Then we organize the complete processing code:

result = []
for uid, tmp in df.groupby("uid"):
    tmp = tmp[["start", "end"]].sort_values('start')
    rows = []
    for start, end in tmp.values:
        if not rows or start > rows[-1][2]:
            rows.append([uid, start, end])
        else:
            rows[-1][2] = max(rows[-1][2], end)
    tmp = pd.DataFrame(rows, columns=["uid", "start", "end"])
    result.append(tmp)
result = pd.concat(result)
result

result:

uid start end
0 A 1 2
1 A 3 7
2 A 8 9
0 B 2 3
1 B 4 8
2 B 10 11
3 B 12 15
0 C 14 15

Okay, it's over, sprinkle flowers!

Guess you like

Origin blog.csdn.net/as604049322/article/details/112387087