How to use complex conditional to fill column by cell in pandas in a efficient way?

Hartnäckig :

I have several old books where each page is filled with historical records of immigrants and their families. Most variables were filled only for the father, usually regarded as the family's chief. So, for example, if the immigrant family is going to live in a city called "small city in the West", only the father would have this information, while the mother and children were supposed to go to the same destiny. Additionally, some observations have no information at all, even for the father.

What I want to do is just filled the missing values for the relatives within the same family (i.e., the same boss). I have reached a solution but it's too inefficient and I'm afraid I'm overcomplicating something that is rather simple. Below I use an example dataset to show my solution.

Example dataset:

m=1

test=pd.DataFrame({'destino_heranca':['A','','','','C','']*m, 'num_familia_raw':[1,1,2,2,3,3]*m}, index=range(6*m))

test

Note that individual 1 has the city A as destiny since here is from family 1. In the other hand, family 2 must be a missing information in the final dataset since I don't have information even for the boss.

destino_heranca num_familia_raw
0   A                  1
1                      1
2                      2
3                      2
4   C                  3
5                      3

Then, I create a dictionary called isdest_null where the keys are the family numbers and the values are boolean values, True if the family's boss has information and False otherwise:

def num_familia_raw_dest(m):
    return list(set(test[test['num_familia_raw']==m].destino_heranca.values))

isdest_null={k:('' in num_familia_raw_dest(k)) & (len(num_familia_raw_dest(k))==1) for k in test.num_familia_raw.unique()}

In a separate executable file called heritage.py I define the following function:

import numpy as np
def heritage(col, data, empty_map):
    for k in data.num_familia_raw.unique():
        if empty_map[k]:
            data[data.num_familia_raw==k]=data[data.num_familia_raw==k].replace({'{}_heranca'.format(col):{'':'nao informado'}})

    #information doesn't exist
    condition1=(data['{}_heranca'.format(col)]=='')
    #same family
    condition2=(data['num_familia_raw']==data['num_familia_raw'].shift(1))

    while '' in data.groupby('num_familia_raw').last()['{}_heranca'.format(col)].values:
        data['{}_heranca'.format(col)]=np.where(condition1 & condition2,data['{}_heranca'.format(col)].shift(1),data['{}_heranca'.format(col)])

    return data['{}_heranca'.format(col)]

Running the full code with the appropriate imports yields:

0                A
1                A
2    nao informado
3    nao informado
4                C
5                C

which is exactly what I want. However, this solution is hugely inefficient and my real data has almost 2 million rows.

Measuring performance with timeit

I'm trying to measure the performance of my implementation to compare it with other solutions that I eventually develop and I would be very grateful if someone help to understand it better. Here is my code:

import timeiit

timeit.timeit("heritage('destino', data=test, empty_map=isdest_null)",number=1000, globals=globals())

output:

23.539601539001524

I'm not sure how to interpret it but according to the documentation this means 23 seconds per loop but what this means in my case?

Quang Hoang :

If the available destino_heranca always appears first in each num_familia_raw, then you can do a transform:

test['destino_heranca'] = (test.groupby('num_familia_raw')['destino_heranca']
                               .transform('first')
                               .replace('','nao informado')
                           )

Output:

  destino_heranca  num_familia_raw
0               A                1
1               A                1
2   nao informado                2
3   nao informado                2
4               C                3
5               C                3

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=220093&siteId=1