Move data from row to another row within a group of specified rows

LivingstoneM :

I have this dataset that I want to transform, so I have just select a piece of how it looks like. So we have a column called Hospital which has those 4 rows which repeat until end of the dataframe. I want to transform so that all the data can only be saved on first row wich is called prelim_arm_1 and delete the rest of the 3 rows arms.

import pandas as pd
import numpy as np

# intialise data of lists. 
data = {'Hospital':['prelim_arm_1' , '24_hour_review_arm_1','48_hour_review_arm_1',
                    '72_hour_review_arm_1','discharge_informat_arm_1','prelim_arm_1' , 
                    '24_hour_review_arm_1','48_hour_review_arm_1',
                    '72_hour_review_arm_1','discharge_informat_arm_1'],
        'Bug_Hosp':['133', 'NAN' , 'NAN', 'NAN', 'NAN','133', 'NAN' , 'NAN', 'NAN', 'NAN'], 
        'code':['G45','NAN' ,'NAN','NAN', 'NAN', 'G45','NAN' ,'NAN','NAN', 'NAN'],
        'cont':['T256','NAN' ,'NAN','NAN', 'NAN','T256','NAN' ,'NAN','NAN', 'NAN'],
        'IPC':['NAN','NAN' ,'NAN','567TY', 'NAN','NAN','NAN' ,'NAN','567Tu', 'NAN'],
        'NO_CT':['NAN','NAN' ,'NAN','NAN', '5667','NAN','NAN' ,'NAN','3456', 'NAN'],
        } 

# Create DataFrame 
df_final = pd.DataFrame(data) 

# Print the output. 
print(df_final)


Final dataset should look like this

import pandas as pd
import numpy as np

# intialise data of lists. 
data = {'Hospital':['prelim_arm_1'],
        'Bug_Hosp':['133'], 'code':['G45'],
        'cont':['T256'],
        'IPC':['567TY'],
        'NO_CT':['5667']} 

# Create DataFrame 
df_final = pd.DataFrame(data) 

# Print the output. 
print(df_final)

The dataset is huge with repeated rows arms but I want for each group of 4 rows, it should only save data on prelim_arm_1 and delete the other 3 row arms. so final table will only have prelim_arm_1 with data per group of 4 arms.

jezrael :

If want first non missing values per each 5 rows use first DataFrame.replace if NAN are strings, else omit this step and then use groupby with GroupBy.first by helper Series created compared Hospital column by first value prelim_arm_1 and Series.cumsum:

#if necessary
df_final = df_final.replace('NAN',np.nan)

df_final = df_final.groupby(df_final['Hospital'].eq('prelim_arm_1').cumsum()).first()
print(df_final)
              Hospital Bug_Hosp code  cont    IPC NO_CT
Hospital                                               
1         prelim_arm_1      133  G45  T256  567TY  5667
2         prelim_arm_1      133  G45  T256  567Tu  3456

Detail:

print(df_final['Hospital'].eq('prelim_arm_1').cumsum())
0    1
1    1
2    1
3    1
4    1
5    2
6    2
7    2
8    2
9    2
Name: Hospital, dtype: int32

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=17030&siteId=1
Row