[python] data mining analysis and cleaning - summary of missing value processing methods


Link to this article: https://blog.csdn.net/weixin_47058355/article/details/128866686

foreword

I saw that there are not many people who have made a complete summary of data cleaning methods on the Internet. The various methods that I just learned in the past few years are a bit messy, so I make a summary by myself, which is convenient for myself and helps others. I hope everyone sees mistakes , You can talk about it in the comment area or private message, and discuss and learn from each other.

1. View the proportion of missing values

A common way to view missing values, the first is to calculate the proportion of missing values

queshi_bili=((data_train.isnull().sum())/data_train.shape[0]).sort_values(ascending=False).map(lambda x:"{:.2%}".format(x)) #queshibili是数据名 data_train是训练集数据
queshi_bili

insert image description here

The second is to use the describe() function

data_train.describe()

insert image description here

2. Statistics-based missing value processing method

I generally divide the missing value processing method into two types, one is a filling method based on statistics, and the other is a filling method based on machine learning.
Follow-up take the feature of other current assets in data_train as an example
insert image description here

2.1 delete

Some data with a large proportion of missing values ​​still need to be deleted, and the filling of missing values ​​is only based on the current data for prediction and calculation, and there are certain errors. However, filling too much data will only lead to errors.

del data['列名']

insert image description here

Delete according to the proportion, here is to delete according to the proportion of 80%

t = int(0.8*data_train.shape[0]) # 确定删除的比例下,占数据多少
data_train_shanchu = data_train.dropna(thresh=t,axis=1)#保留至少有 t 个非空的列
data_train_shanchu

insert image description here

2.2 Filling with fixed values

Fill missing values ​​with the given constant

data.fillna(0, inplace=True) # 填充 0 第一个参数控制填充的常数

insert image description here

It can also be filled with fixed values ​​in the form of a dictionary. If the given dictionary is not filled enough, it will still be in the state of missing values.

data.fillna({
    
    0:1000, 1:100, 2:0, 4:5}) 

insert image description here

2.3 Filling median, mean, mode

The codes of these three numbers are similar, you just need to change the function to something else. The picture is an example of average filling

data.fillna(data.mean(),inplace=True) # 填充均值
data.fillna(data.median(),inplace=True) # 填充中位数
data.fillna(data.mode(),inplace=True) # 填充众数

insert image description here

2.4 Filling by interpolation, before or after value filling

The principle of the interpolation method is to add and divide the upper and lower data of the missing value data by 2, that is, to take the average. The disadvantage is that if there is no value in the front and no value in the back, the missing value will still exist.

data = data.interpolate()#上下两个数据的均值进

insert image description here

Filling the previous or following value, filling the previous or next data of the missing value, the disadvantage is the same as that of the interpolation method, the disadvantage is that if there is no value in the front and no value in the back, the missing value will still exist. (The picture shows the previous value as an example)

data.fillna(method='pad', inplace=True) # 填充前一条数据的值,但是前一条也不一定有值
data.fillna(method='bfill', inplace=True) # 填充后一条数据的值,但是后一条也不一定有值

insert image description here

3. Missing value filling based on machine learning

The machine learning algorithm used to fill missing values ​​is better than statistical methods in terms of accuracy, but the relative computing power and time required are far greater than statistical methods.
Here is only a demonstration of the code implementation, and the algorithm principle can be searched by yourself.

3.1 Filling based on knn algorithm

from fancyimpute import KNN
data_train_knn = pd.DataFrame(KNN(k=6).fit_transform(data_train_shanchu)#这里的6是对周围6个数据进行欧式距离计算,得出缺失值的结果,可以自行调整
columns=data_train_shanchu.columns)
data_train_knn

insert image description here
insert image description here
insert image description here

3.2 Filling based on random forest

Use random forest to fill missing values. Others like lightgbm and xgboost are feasible. Here we take random forest as an example. The
operation is to use other features as data and then fill in the missing data to get missing values.

from sklearn.ensemble import RandomForestRegressor
#利用随机森林树进行填补缺失值
train_data = train_data[['其他流动资产', '货币资金', '资产总计']]
df_notnull = train_data.loc[(train_data['其他流动资产'].notnull())]
df_isnull = train_data.loc[(train_data['其他流动资产'].isnull())]
X = df_notnull.values[:,1:]
Y = df_notnull.values[:,0]

# use RandomForestRegression to train data
RFR = RandomForestRegressor(n_estimators=1000, n_jobs=-1)
RFR.fit(X,Y)
predict = RFR.predict(df_isnull.values[:,1:])
predict

insert image description here

Summarize

At present, these are the commonly used missing value processing methods. In the future, I will update other data cleaning methods. I will not set up a paid column, I hope it will be helpful to everyone!

Guess you like

Origin blog.csdn.net/weixin_47058355/article/details/128866686