Pandas groupby apply gets stuck at the end

Recently, I encountered a bug again. After a long time of elimination, I finally came to a conclusion.

Phenomenon: When using groupby().apply(), the program gets stuck when the result is finally produced.

Example of error code:

import pandas as pd
from tqdm import tqdm


def func(df):
    df['mean_sa'] = df['salary'].shift(1)
    return df

data = [[0, 'C', 43, 35]] * 3000 + \
       [[1, 'C', 18, 30]] * 3000 + \
       [[1, 'A', 20, 22]] * 3000

df_ori = pd.DataFrame(data, columns=['idx', 'company', 'salary', 'age'])
df_ori = df_ori.set_index('idx')
tqdm.pandas(desc='Get returns')
df_ori = df_ori.groupby('company').progress_apply(func)

Phenomenon explanation: data structure problem.

When the input data has many rows, and then the index and the columns requiring groupby have many repeated values, the groupby apply will finally get stuck.

The solution is simple:

Method 1: Sort the column that needs to be grouped first! ! ! In this case the second column, after that everything works fine.

# 排序前
data = [[0, 'C', 43, 35]] * 3000 + \
       [[1, 'C', 18, 30]] * 3000 + \
       [[1, 'A', 20, 22]] * 3000

# 排序后
data = [[1, 'A', 43, 35]] * 3000 + \
       [[0, 'C', 18, 30]] * 3000 + \
       [[1, 'C', 18, 30]] * 3000

Method 2: Use a different return value. Just modify the following three codes.

import pandas as pd
from tqdm import tqdm


def func(df):
    mean_sa = df['salary'].shift(1)  # 这里不一样
    return mean_sa   # 这里不一样

data = [[0, 'C', 43, 35]] * 3000 + \
       [[1, 'C', 18, 30]] * 3000 + \
       [[1, 'A', 20, 22]] * 3000

df_ori = pd.DataFrame(data, columns=['idx', 'company', 'salary', 'age'])
df_ori = df_ori.set_index('idx')
tqdm.pandas(desc='Get returns')
df_new = df_ori.groupby('company').progress_apply(func)  # 这里不一样

The above is the solution to the stuck groupby().apply(), I hope it can help you. Welcome to like + comment.

Guess you like

Origin blog.csdn.net/BeiErGeLaiDe/article/details/130643695