Summary of missing value handling methods
Link to this article: https://blog.csdn.net/weixin_47058355/article/details/128866686
foreword
I saw that there are not many people who have made a complete summary of data cleaning methods on the Internet. The various methods that I just learned in the past few years are a bit messy, so I make a summary by myself, which is convenient for myself and helps others. I hope everyone sees mistakes , You can talk about it in the comment area or private message, and discuss and learn from each other.
1. View the proportion of missing values
A common way to view missing values, the first is to calculate the proportion of missing values
queshi_bili=((data_train.isnull().sum())/data_train.shape[0]).sort_values(ascending=False).map(lambda x:"{:.2%}".format(x)) #queshibili是数据名 data_train是训练集数据
queshi_bili
The second is to use the describe() function
data_train.describe()
2. Statistics-based missing value processing method
I generally divide the missing value processing method into two types, one is a filling method based on statistics, and the other is a filling method based on machine learning.
Follow-up take the feature of other current assets in data_train as an example
2.1 delete
Some data with a large proportion of missing values still need to be deleted, and the filling of missing values is only based on the current data for prediction and calculation, and there are certain errors. However, filling too much data will only lead to errors.
del data['列名']
Delete according to the proportion, here is to delete according to the proportion of 80%
t = int(0.8*data_train.shape[0]) # 确定删除的比例下,占数据多少
data_train_shanchu = data_train.dropna(thresh=t,axis=1)#保留至少有 t 个非空的列
data_train_shanchu
2.2 Filling with fixed values
Fill missing values with the given constant
data.fillna(0, inplace=True) # 填充 0 第一个参数控制填充的常数
It can also be filled with fixed values in the form of a dictionary. If the given dictionary is not filled enough, it will still be in the state of missing values.
data.fillna({
0:1000, 1:100, 2:0, 4:5})
2.3 Filling median, mean, mode
The codes of these three numbers are similar, you just need to change the function to something else. The picture is an example of average filling
data.fillna(data.mean(),inplace=True) # 填充均值
data.fillna(data.median(),inplace=True) # 填充中位数
data.fillna(data.mode(),inplace=True) # 填充众数
2.4 Filling by interpolation, before or after value filling
The principle of the interpolation method is to add and divide the upper and lower data of the missing value data by 2, that is, to take the average. The disadvantage is that if there is no value in the front and no value in the back, the missing value will still exist.
data = data.interpolate()#上下两个数据的均值进
Filling the previous or following value, filling the previous or next data of the missing value, the disadvantage is the same as that of the interpolation method, the disadvantage is that if there is no value in the front and no value in the back, the missing value will still exist. (The picture shows the previous value as an example)
data.fillna(method='pad', inplace=True) # 填充前一条数据的值,但是前一条也不一定有值
data.fillna(method='bfill', inplace=True) # 填充后一条数据的值,但是后一条也不一定有值
3. Missing value filling based on machine learning
The machine learning algorithm used to fill missing values is better than statistical methods in terms of accuracy, but the relative computing power and time required are far greater than statistical methods.
Here is only a demonstration of the code implementation, and the algorithm principle can be searched by yourself.
3.1 Filling based on knn algorithm
from fancyimpute import KNN
data_train_knn = pd.DataFrame(KNN(k=6).fit_transform(data_train_shanchu)#这里的6是对周围6个数据进行欧式距离计算,得出缺失值的结果,可以自行调整
columns=data_train_shanchu.columns)
data_train_knn
3.2 Filling based on random forest
Use random forest to fill missing values. Others like lightgbm and xgboost are feasible. Here we take random forest as an example. The
operation is to use other features as data and then fill in the missing data to get missing values.
from sklearn.ensemble import RandomForestRegressor
#利用随机森林树进行填补缺失值
train_data = train_data[['其他流动资产', '货币资金', '资产总计']]
df_notnull = train_data.loc[(train_data['其他流动资产'].notnull())]
df_isnull = train_data.loc[(train_data['其他流动资产'].isnull())]
X = df_notnull.values[:,1:]
Y = df_notnull.values[:,0]
# use RandomForestRegression to train data
RFR = RandomForestRegressor(n_estimators=1000, n_jobs=-1)
RFR.fit(X,Y)
predict = RFR.predict(df_isnull.values[:,1:])
predict
Summarize
At present, these are the commonly used missing value processing methods. In the future, I will update other data cleaning methods. I will not set up a paid column, I hope it will be helpful to everyone!