[Machine Learning] Summary of Processing Methods for Missing Values

1. General record

insert image description here
insert image description here

2. Introduction

There is a widely circulated saying in the industry: data and features determine the upper limit of machine learning, and models and algorithms only approach this upper limit.

If there is no good data, the effect of the trained model will not be improved. It can be seen that data quality is very important for data analysis. When we download or grab the required data, we will inevitably encounter some missing data, so how to deal with these missing data?

3. Reasons for missing data

In various databases, missing attribute values ​​occur from time to time. There are many reasons for the lack of data, and the editor believes that they can be mainly divided into the following two types:

  • Objective reasons: Some information cannot be extracted temporarily, such as clinical test results that cannot be obtained in a short time; some attribute values ​​​​do not exist, such as the name of the spouse of an unmarried person; the cost of obtaining certain information is too high; the system requires high real-time performance , requiring a decision to be made before this information is available.
  • Subjective reason: Data is omitted due to human factors during input; information considered irrelevant.

In short, we need to be clear about the cause of the missing value: whether it is caused by the negligence of the staff, or the data itself cannot be extracted. Only by clarifying the reasons for missing data can we prescribe the right medicine.

4. Types of missing data

Clarify two concepts: variables without missing values ​​in the data set are called 完全变量, and variables with missing values ​​in the data set are called 不完全变量. From the point of view of the distribution of deletions, deletions can be divided into 完全随机缺失, 随机缺失and 完全非随机缺失.

  • Completely missing at random: It means that the missing of data is completely random, does not depend on any incomplete variable or complete variable, and does not affect the unbiasedness of the sample. Such as missing home address.
  • Missing at random: It means that the missing of data is not completely random, that is, the missing of this type of data depends on other complete variables. If financial data is missing, it is related to the size of the enterprise.
  • Non-missing at random: It means that the missing data is related to the value of the incomplete variable itself. Such as high-income groups are unwilling to provide family income.

The type of missing data is related to how we deal with missing data. For the latter two cases, it is not appropriate to directly delete the record. Missing at random can estimate missing values ​​through known variables, and there is no good solution to the non-randomness of non-random missing.

5. How to deal with missing data

What are the methods of data processing?

Handling incomplete datasets is mainly divided into the following three categories:

  • Delete Record
  • data padding
  • not deal with

5.1 Deleting records

The method is convenient, fast, simple and crude. However, it sacrifices a large amount of data and may lose a lot of hidden important information, especially when the missing data accounts for a large proportion, which may directly lead to changes in the distribution of data and lead to wrong conclusions. This method is suitable for cases where the missing data is relatively small compared to the sample data. Features with missing values ​​can be removed directly using pandasin .dropna

# 直接删除含有缺失值的行
df.dropna(how='any',axis = 0)  
df.dropna()  # 等价形式
# 直接删除含有缺失值的列
df.dropna(how='any',axis = 1) 
# 只删除全是缺失值的行
df.dropna(how = 'all')

5.2 Data filling

This method is to fill the empty value with a certain value, so as to make the data complete. There are usually three types of filling methods, namely 替换缺失值, , 拟合缺失值and 虚拟变量.

5.2.1 Replace missing values

(1) Mean filling. 均值Divide all attributes into numerical attributes and non-numeric attributes. If the null value is quantitative, it will be filled according to the value of this attribute in all other objects; 中位数if the null value is qualitative, it will be filled with this attribute in all other objects. Mode to fill. This method is simple and not accurate enough, and may change the original distribution of features.

# 使用var1的均值/中位数对 NA进行填充
df['var1'].fillna(df['var1'].mean())
df['var1'].fillna(df['var1'].median())

(2) Hot card filling. Its idea is very simple, that is, to find an object that is closest to the missing value object to fill. Usually more than one similar object is found, there is no best among all matching objects, but one is randomly picked as the fill value. Different problems may use different standards to judge similarity, but it is difficult to define similarity standards, and there are too many subjective factors.

(3) K nearest neighbor method. First determine the K samples closest to the sample with missing data according to or, and estimate the missing data of the sample by weighting the K values 欧氏距离.相关分析

(4) Fill with all possible values. This method is to fill in all possible attribute values ​​of the vacant attribute value, which can get a better filling effect. However, when the amount of data is large and there are many missing attribute values, the calculation cost is high.

5.2.2 Fitting missing values

(1) RETURN. Based on the complete data set, a regression equation is established. For objects containing null values, the unknown property values ​​are estimated by substituting known property values ​​into the equation, and filled with the estimated values. But the disadvantage is that it can lead to biased estimates when the variables are not linearly correlated or when the predictors are highly correlated.

(2) Expected value maximization method. The EM algorithm is used here, which is divided into two steps: E step and M step. The E step calculates the conditional expectation of the logarithmic likelihood function corresponding to the complete data given the complete data and the parameter estimation obtained in the previous iteration. The M step is to maximize the logarithmic likelihood function to determine the value of the parameter and use it for the next iteration. The algorithm iterates continuously between step E and step M until convergence. However, the disadvantage of this method is that it may fall into local extremum, and the convergence speed is not fast, and the calculation is more complicated.

(3) Multiple imputation method. This method is derived from Bayesian estimation and is mainly divided into the following three steps: first, a set of possible filling values ​​is generated for each null value, which reflects the uncertainty of the non-response model, and each value is calculated by Used to fill in the missing values ​​in the data set to generate several complete data sets; second, each filled data set is statistically analyzed using the statistical method for the complete data set; third, the results from each filled data set are synthesized , yielding a final statistical inference that takes into account the uncertainty due to data imputation. This method treats the missing value as a random sample, and the statistical inferences calculated in this way may be affected by the uncertainty of the missing value. The disadvantage of this method is that it is computationally complex.

(4) Random forest. This is a method often used in many competitions, and it also uses missing values ​​as target variables. A case code is as follows:

# 导入随机森林模块
from sklearn.ensemble import RandomForestRegressor
# 定义函数
def set_missing_ages(df):
    #把数值型特征都放到随机森林里面去
    age_df=df[['Age','Fare','Parch','SibSp','Pclass']]
    known_age = age_df[age_df.Age.notnull()].as_matrix()
    unknown_age = age_df[age_df.Age.isnull()].as_matrix()
    y=known_age[:,0]  # y是年龄,第一列数据
    x=known_age[:,1:]  # x是特征属性值,后面几列
    rfr=RandomForestRegressor(random_state=0,n_estimators=2000,n_jobs=-1)
    #根据已有数据去拟合随机森林模型
    rfr.fit(x,y)
    #预测缺失值
    predictedAges = rfr.predict(unknown_age[:,1:])
    #填补缺失值
    df.loc[(df.Age.isnull()),'Age'] = predictedAges
    
    return df,rfr

5.2.3 Dummy variables

A dummy variable is a derived variable with missing values. The specific method is to define a new binary classification variable by judging whether the feature value has missing values. For example, if the feature A contains missing values, we derive a new feature B. If the eigenvalues ​​in A are missing, then the corresponding value in B is 1. If the eigenvalues ​​in A are not missing, then the corresponding B in B The value is 0. A case code is as follows:

# 复制该列数据到 CabinCat
data_train['CabinCat'] = data_train['Cabin'].copy()
# 设置虚拟变量
data_train.loc[ (data_train.CabinCat.notnull()), 'CabinCat' ] = "No"
data_train.loc[ (data_train.CabinCat.isnull()), 'CabinCat' ] = "Yes"
# 查看
data_train[['Cabin','CabinCat']].head(10)

5.3 Do not process

The first two categories of methods just fill in the unknowns with our subjective estimates, we more or less change the original dataset. Moreover, incorrect padding of null values ​​often introduces new noise into the data, making mining tasks produce erroneous results. Therefore, in many cases, we still hope to process the information system on the premise of keeping the original information unchanged. In addition, there are some models that are sufficient by themselves to deal with data with missing values, and do not need to process the data at this time, such as XGBoostadvanced models.

6. Demonstration exercise

Here is a case to briefly learn several commonly used methods for missing value processing:

import pandas as pd
# 导入数据集
df = pd.read_csv('Data.csv',encoding = 'gbk',na_values='Na')    #为空数据赋值
df.dtypes
# 对里程数进行处理
def f(x):
    if '$' in str(x):
        x = str(x).strip('$')
        x = str(x).replace(',','')
    else:
        x = str(x).replace(',','')
    return float(x)
df['Mileage'] = df['Mileage'].apply(f)
# 显示各变量缺失比例
df.apply(lambda x: sum(x.isnull())/len(x),axis= 0)
# 直接删除法——删除'Condition','Price','Mileage'三个变量含有缺失值的行
df.dropna(axis = 0,how='any',subset=['Condition','Price','Mileage'])
# 里程用均值填补
df.Mileage.fillna(df.Mileage.mean())
# 用众数填补
df.Exterior_Color.fillna(df.Exterior_Color.mode()[0]) 
# 婚姻状况使用众数填补,年龄使用均值填补,农户家庭人数使用中位数填补
df.fillna(value = {
    
    'Exterior_Color':df.Exterior_Color.mode()[0],'Mileage':df.Mileage.mean()})
# 虚拟变量法
df['Watch_Count1'] = df['Watch_Count'].copy()
df.loc[ (df.Watch_Count.notnull()), 'Watch_Count1' ] = "No"
df.loc[ (df.Watch_Count.isnull()), 'Watch_Count1' ] = "Yes"
df[['Watch_Count','Watch_Count1']].head(10)

7. Summary

In general, most data mining preprocessing uses more convenient methods to deal with missing values, such as the mean method, but the effect is not necessarily good, so it is still necessary to choose the appropriate method according to different needs, and there is no universal method . The specific method still needs to be considered from many aspects.

insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/128745956