With a pandas dataframe of the form:
A B C
ID
1 10 NaN NaN
2 20 NaN NaN
3 28 10.0 NaN
4 32 18.0 10.0
5 34 22.0 16.0
6 34 24.0 20.0
7 34 26.0 21.0
8 34 26.0 22.0
How can I remove a varying number of initial missing values? Initially, I'd like to forward fill the last values of the "new" columns so I'll end up with this:
A B C
0 10 10.0 10.0
1 20 18.0 16.0
2 28 22.0 20.0
3 32 24.0 21.0
4 34 26.0 22.0
5 34 26.0 22.0
6 34 26.0 22.0
7 34 26.0 22.0
But I guess it would be just as natural to have nans on the remaining rows too:
A B C
0 10 10.0 10.0
1 20 18.0 16.0
2 28 22.0 20.0
3 32 24.0 21.0
4 34 26.0 22.0
5 34 26.0 NaN
6 34 NaN NaN
7 34 NaN NaN
Here's a visual representation of the issue:
Before:
After:
I've come up with a cumbersome approach using a for loop where I remove the leading nans using df.dropna()
, count the number of values I've removed (N), append the last available number N times, and build a new dataframe column by column. But this turned out to be pretty slow for larger dataframes. I feel like this is something that's already a built-in functionality of the omnipotent pandas library, but I haven't found anything so far. Does anyone have a suggestion to a less cumbersome way of doing this?
Complete code with a sample dataset:
import pandas as pd
import numpy as np
# sample dataframe
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8],
'A': [10,20,28,32,34,34,34,34],
'B': [np.nan, np.nan, 10,18,22,24,26,26],
'C': [np.nan, np.nan, np.nan,10,16,20,21,22]})
df=df.set_index('ID')
# container for dataframe
# to be built using a for loop
df_new=pd.DataFrame()
for col in df.columns:
# drop missing values column by column
ser = df[col]
original_length = len(ser)
ser_new = ser.dropna()
# if leading values are removed for N rows.
# append last value N times for the last rows
if len(ser_new) <= original_length:
N = original_length - len(ser_new)
ser_append = [ser.iloc[-1]]*N
#ser_append = [np.nan]*N
ser_new = ser_new.append(pd.Series(ser_append), ignore_index=True)
df_new[col]=ser_new
df_new
Here is a pure Pandas solution. Use apply to shift the values up depending on number of leading NaN's and use ffill,
df.apply(lambda x: x.shift(-x.isna().sum())).ffill()
A B C
1 10 10.0 10.0
2 20 18.0 16.0
3 28 22.0 20.0
4 32 24.0 21.0
5 34 26.0 22.0
6 34 26.0 22.0
7 34 26.0 22.0
8 34 26.0 22.0