I have this dataset that I want to transform, so I have just select a piece of how it looks like. So we have a column called Hospital which has those 4 rows which repeat until end of the dataframe. I want to transform so that all the data can only be saved on first row wich is called prelim_arm_1 and delete the rest of the 3 rows arms.
import pandas as pd
import numpy as np
# intialise data of lists.
data = {'Hospital':['prelim_arm_1' , '24_hour_review_arm_1','48_hour_review_arm_1',
'72_hour_review_arm_1','discharge_informat_arm_1','prelim_arm_1' ,
'24_hour_review_arm_1','48_hour_review_arm_1',
'72_hour_review_arm_1','discharge_informat_arm_1'],
'Bug_Hosp':['133', 'NAN' , 'NAN', 'NAN', 'NAN','133', 'NAN' , 'NAN', 'NAN', 'NAN'],
'code':['G45','NAN' ,'NAN','NAN', 'NAN', 'G45','NAN' ,'NAN','NAN', 'NAN'],
'cont':['T256','NAN' ,'NAN','NAN', 'NAN','T256','NAN' ,'NAN','NAN', 'NAN'],
'IPC':['NAN','NAN' ,'NAN','567TY', 'NAN','NAN','NAN' ,'NAN','567Tu', 'NAN'],
'NO_CT':['NAN','NAN' ,'NAN','NAN', '5667','NAN','NAN' ,'NAN','3456', 'NAN'],
}
# Create DataFrame
df_final = pd.DataFrame(data)
# Print the output.
print(df_final)
Final dataset should look like this
import pandas as pd
import numpy as np
# intialise data of lists.
data = {'Hospital':['prelim_arm_1'],
'Bug_Hosp':['133'], 'code':['G45'],
'cont':['T256'],
'IPC':['567TY'],
'NO_CT':['5667']}
# Create DataFrame
df_final = pd.DataFrame(data)
# Print the output.
print(df_final)
The dataset is huge with repeated rows arms but I want for each group of 4 rows, it should only save data on prelim_arm_1 and delete the other 3 row arms. so final table will only have prelim_arm_1 with data per group of 4 arms.
If want first non missing values per each 5 rows use first DataFrame.replace
if NAN
are strings, else omit this step and then use groupby
with GroupBy.first
by helper Series created compared Hospital
column by first value prelim_arm_1
and Series.cumsum
:
#if necessary
df_final = df_final.replace('NAN',np.nan)
df_final = df_final.groupby(df_final['Hospital'].eq('prelim_arm_1').cumsum()).first()
print(df_final)
Hospital Bug_Hosp code cont IPC NO_CT
Hospital
1 prelim_arm_1 133 G45 T256 567TY 5667
2 prelim_arm_1 133 G45 T256 567Tu 3456
Detail:
print(df_final['Hospital'].eq('prelim_arm_1').cumsum())
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
Name: Hospital, dtype: int32