Pandas fills missing values for different categories: clever use of df.transform aggregation method

First look at the data:

import pandas as pd
x = pd.DataFrame([[166,52,'男'],[152,43,'女'],[182,73,'男'],[172,63,'女'],[np.nan,np.nan,'女'],[np.nan,np.nan,'男']],columns = ['身高','体重','性别'])
x

 

The commonly used method of filling missing values ​​is to use the mean, mode, etc. to fill, as follows:

However, when we need to fill in the mean value of different categories of data, it may be troublesome. The commonly used ideas may be as follows, that is, take out different categories of data, and then fill them in turn:

labels = x['性别'].unique()
for label in labels:
    for col in x.columns[:-1]:
        data_ = x.loc[x['性别']==label, col]
        x.loc[x['性别']==label, col] = data_.fillna(data_.mean())
print(x)

 

 But we can do this more easily using the grouping and transform aggregation methods:


x = pd.DataFrame([[166,52,'男'],[152,43,'女'],[182,73,'男'],[172,63,'女'],[np.nan,np.nan,'女'],[np.nan,np.nan,'男']],columns = ['身高','体重','性别'])

x.loc[:,x.columns != '性别'] = x.groupby('性别').transform(lambda x:x.fillna(x.mean()))
print(x)

 

The function of the transform method is to return the number obtained by the aggregation of the grouped data to each row (if the aggregation is a single scalar, it will be returned to each row, that is, the data of each row of the same group is equal, if the aggregation is the original data The size of the corresponding return to the original data, such as the result here), after grouping the gender here, fill each group and return to the original data, so that different groups can be filled.

Guess you like

Origin blog.csdn.net/weixin_46707493/article/details/126740393