Machine Learning - Handling Missing Values

Handling missing values

In the python language, missing values ​​are generally called nan, which is an abbreviation for "not a number".
The following code can calculate how many missing values ​​the data has in total, here the data is stored in a DateFrame in pandas:

print(data.isnull().sum())

There are several ways to handle missing values:

1. Remove data columns that contain missing values

data_without_missing_values=original_data.dropna(axis=1)

In most cases, we have to drop the same columns of data in both the training dataset and the test dataset.

cols_with_missing=[col for col in original_data.columns 
if original_data[col].isnull().any()]
       reduced_original_data=original_data.drop(cols_with_missing,axis=1)
reduced_test_data=test_data.drop(cols_with_missing,axis=1)

This method is suitable for cases where the data column contains too many missing values

2. Fill in missing values

This method is better than directly deleting the data column, and can train a better model.

from sklearn.preprocessing import Imputer
    my_inputer=Imputer()
data_with_imputed_values=my_imputer.fit_transform(original_data)

The default padding strategy is to use mean padding

3. Extension method

If the missing data contains important feature information, we need to save the missing value information of the original data and store it in the boolean column

#先拷贝一份原始数据
new_data=original_data.copy()
#创建新的列保用来保存缺失数据列的缺失情况
cols_with_missing=(col for col in new_data.columns if new_data[col].isnull().any())
for col in cols_with_missing:
    new_data[col+'_was_missing']=new_data[col].isnull()
#插值
my_imputer=Imputer()
new_data=my_imputer.fit_transform(new_data)

Example

Below is an example of a house price forecast to compare the three cases mentioned above for dealing with missing values.

import pandas as pd

# 加载数据
melb_data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

melb_target = melb_data.Price
melb_predictors = melb_data.drop(['Price'], axis=1)

# 为了简单起见,我们只使用数字特征的列训练预测模型 
melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])
#划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(melb_numeric_predictors, 
                                                    melb_target,
                                                    train_size=0.7, 
                                                    test_size=0.3, 
                                                    random_state=0)
#定义一个函数,度量模型的MAE指标
def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()#这里选用随机森林模型
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)

To test the effect of the first method, delete columns of data that contain missing values

cols_with_missing = [col for col in X_train.columns 
                                 if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test  = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

Mean Absolute Error from dropping columns with Missing Values:
347871.8471099837

Second method, padding with column mean

from sklearn.preprocessing import Imputer

my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))

Mean Absolute Error from Imputation:
201753.99398441747

The third method, adding an extra column to save the information of missing values

#拷贝原始数据
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()
#取得含有缺失值的列名元组
cols_with_missing = (col for col in X_train.columns 
                                 if X_train[col].isnull().any())
#新增列保存缺失信息,形成的新增列包含诸如(True,False,True,False这样的序列)
for col in cols_with_missing:
    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()

# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

Mean Absolute Error from Imputation while Track What Was Imputed:
200147.29626743973

Summarize

In the above example, the performance difference between methods 2 and 3 is not very large, but in some cases, the difference is very significant

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326054679&siteId=291194637