Feature Engineering - Missing Value Imputation Using Random Forest

Feature Engineering - Missing Value Imputation Using Random Forest

I. Introduction

Feature engineering is a very important step in traditional machine learning, and our optimization of machine learning algorithms is usually limited. If you find that the results obtained by the optimization algorithm are unsatisfactory when completing the task, you can consider turning back and doing feature engineering at this time.

2. Missing value imputation

Handling missing values ​​is a common problem in feature engineering. The processing method is usually as follows:

  1. Delete data with default values
  2. Fill missing values ​​with the mean of that feature in the data
  3. Fill missing values ​​with the median for that feature in the data
  4. Fill missing values ​​using the mode of that feature in the data
  5. Fill Missing Values ​​Using Machine Learning Models

The above methods have their own advantages, and we can choose a strategy according to our own needs. When the data set is relatively large, the last method is the better comprehensive performance. Today we will talk about using random forest to fill missing values.

3. Data preprocessing

3.1, processing ideas

Before starting to populate the data outside, we also need to perform some simple processing on the original data. Suppose we now want to populate the following data:

name sex age target
zack male 20 1
rudy male 30 1
alice female 20 0
atom male 31 0
alex female 32 1
kerry female 0
king 20 1
nyx male 20 1
petty female 0

When using scikit-learn to create a random forest, the feature value of our training data is not allowed to be strings, so we need to process the columns of name, gender, and city, and here we adopt the one-hot encoding strategy.

Note: The above is some data I fabricated, as for the meaning of target, I don't know.

First of all, the name feature will not affect the final result in many cases, so we directly choose to delete the name feature. Then there are gender and city features, which are both type features. For gender, we can use 0 for male and 1 for female. And city is a multi-category feature, we can also take the same method as gender, 0 represents city_01, 1 represents city_02, and 2 represents city_03. However, this will lead to different weights of city features. If there are too many categories, it will have a great impact on the results.

At this time, we can change a strategy. We can split the original city feature into three features, namely city=city_01, city=city_02, city=city_03, and then the feature value is only 0 or 1, so that the above can be solved. the problem.

For example, our original data is as follows:

name gender age city target
zack male twenty one city_01 1
alice female twenty two city_02 0

The transformed data is as follows (ignoring the name feature):

name gender=male gender=female age city=city_01 city=city_02 city=city_03 target
zack 1 0 twenty one 1 0 0 1
alice 0 1 twenty two 0 1 0 0

We also have a city=city_03 feature above, because we want to consider the entire dataset for splitting.

Another point to note here is that the gender feature can not be split in this way. For convenience, no other strategy is used for the gender.

3.2, code implementation

According to the above ideas, we know how to deal with multi-category features. For numeric features, we don't need to do additional processing, so we need to iterate over the columns of the feature and determine whether it is the column we want to process. The specific code is as follows:

import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer

# 创建DictVectorizer
dv = DictVectorizer(sparse=False)
# 读取数据
df = pd.read_csv("test.csv")
# 删除name列
df = df.drop(['name'], axis=1)
# 裁剪出特征值
X = df.iloc[:, 0:-1]

# 遍历特征值的列
for colum in X.iteritems():
    # 对非数值型列进行处理(多类别数据)
    if colum[1].dtype == np.object_:
        # 拆分出列名和数据
        feature_name, data = colum
        
        # ①、将该列转换成字典
        colum = data.map(lambda x: {
    
    feature_name: x})
        colum = dv.fit_transform(colum)
        # 多分类特征名转换后的特征名,如gender->[gender=male, gender=female]
        features = dv.get_feature_names_out()
        # 将新创建的列添加进去
        X[features] = colum
        # 删除当前列
        X = X.drop([feature_name], axis=1)
        # ②、如果原先值是空,则吧所以新添加的列设置为nan
        if list(features).__contains__(feature_name):
            features = list(features)
            features.remove(feature_name)
            features = np.array(features)
        
        # 对于特征值是null的数据,转换后的各个特征也应为null
        # 如:gender为null,那gender=male为null,gender=female为null
        mask = X[features].sum(axis=1) == 0
        X.loc[mask, features] = np.nan

For most of the code, I believe readers can understand. Here to explain the two parts ① and ② in the code.

3.3, code analysis

(1) Question①

At 1, we convert the data of the current column into a dictionary, and then call the fit_transform method of the DictVectorizer object. We directly see the role of the DictVectorizer. Take a look at the following code:

from sklearn.feature_extraction import DictVectorizer
# 待处理字典列表
data = [
    {
    
    "gender": "male"},
    {
    
    "gender": "female"},
    {
    
    "gender": "unknow"},
    {
    
    "gender": "male"},
    {
    
    "gender": "male"}
]
dv = DictVectorizer(sparse=False)
# 转换数据
data = dv.fit_transform(data)
print(dv.get_feature_names_out())
print(data)

The output of the above code is as follows:

['gender=female' 'gender=male' 'gender=unknow']
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]]

As you can see, this is the same as the conversion we mentioned above. Because dv receives a sequence of dictionaries, we need to use the following code first:

colum = data.map(lambda x: {
    
    feature_name: x})

This converts the current column to a dictionary sequence type. Then call dv.fit_transform to achieve the transformation.

(2) Question ②

This part of the code is to make gender=female and gender=male should also be nan after the data whose gender is nan is converted. However, for data with missing values, the following problems will occur during the transformation process:

from sklearn.feature_extraction import DictVectorizer
data = [
    {
    
    "gender": "male"},
    {
    
    "gender": "female"},
    {
    
    "gender": "unknow"},
    {
    
    "gender": "male"},
    {
    
    "gender": None}
]
dv = DictVectorizer(sparse=False)
data = dv.fit_transform(data)
print(dv.get_feature_names_out())
print(data)

Above we added a data with missing values, the output is as follows:

['gender' 'gender=female' 'gender=male' 'gender=unknow']
[[ 0.  0.  1.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]
 [nan  0.  0.  0.]]

It can be found that we only expected three columns, but four columns appeared. So we need to remove the extra column in dv.get_feature_names_out().

Everywhere, our data is processed. Below we can use random forest to fill missing values.

Fourth, use random forest to fill in missing values

4.1. Implementation ideas

The process of filling in missing values ​​is the process of continuously building model predictions. Let's still look at a simple set of data:

height weight age
181 70 20
178 18
160 50
170 60 19

The above data has two features with missing values, which we need to fill. When we want to fill the weight, we can consider selecting the data whose weight is not empty. Then take the remaining columns as feature values ​​and weight as the target value. This way we can train a model that can predict weight.

But there is a problem with the above method, that is, we select data whose weight is not empty, but other features of these data may be empty. At this time, we can consider using other simple methods to fill in the remaining missing values ​​first, and then train the model to fill in the missing values ​​of weight.

After filling the missing values ​​of weight, use the same method to fill the remaining features with missing values.

In order to achieve good results, we will limit the selection of columns with a small number of missing values, because then we can get more data and can better fill the data of this column. And so on.

4.2, code implementation

This part is carried out after the implementation of the above multi-classification processing. The complete code is as follows:

y = df.iloc[:, [-1]]

# 按照当前列缺失值的数量进行升序排列
sortindex = np.argsort(X.isnull().sum(axis=0)).values #axis=0按列进行加和
for i in sortindex:
    # 将当前列作为目标值
    feature_i = X.iloc[:, i]
    
    # 将其余列作为特征值(包括目标值)
    tmp_df = pd.concat([X.iloc[:, X.columns != i], y], axis=1)
    # 使用众数填充其余列缺失值
    imp_mf = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
    tmp_df_mf = imp_mf.fit_transform(tmp_df)

    # 将feature_i中非空的样本作为训练数据
    y_notnull = feature_i[feature_i.notnull()]
    y_null = feature_i[feature_i.isnull()]   
    X_notnull = tmp_df_mf[y_notnull.index, :]
    X_null = tmp_df_mf[y_null.index, :] 
    
    # 如果没有缺失值则填充下一列
    if y_null.shape[0] == 0:
        continue

    # 建立随机森林回归树进行训练
    rfc = RandomForestRegressor(n_estimators=100)
    rfc = rfc.fit(X_notnull, y_notnull)
    
    # 对缺失值进行预测
    y_predict = rfc.predict(X_null)

    # 填充缺失值
    X.loc[X.iloc[:, i].isnull(), X.columns[i]] = y_predict

In this way, we implement the operation of filling missing values ​​with random forest. The complete code is as follows:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_extraction import DictVectorizer
from sklearn.impute import SimpleImputer

dv = DictVectorizer(sparse=False)
df = pd.read_csv("test.csv")
name = df['name']
df = df.drop(['name'], axis=1)
X = df.iloc[:, 0:-1]

# 遍历数据的列
for colum in X.iteritems():
    # 对非数值型列进行处理
    if colum[1].dtype == np.object_:
        # 拆分出列名和数据
        feature_name, data = colum

        # 将该列转换成字典
        colum = data.map(lambda x: {
    
    feature_name: x})
        colum = dv.fit_transform(colum)
        features = dv.get_feature_names_out()

        # 将新创建的列添加进去
        X[features] = colum

        # 删除当前列
        X = X.drop([feature_name], axis=1)

        # 如果原先值是空,则吧所以新添加的列设置为nan
        if list(features).__contains__(feature_name):
            features = list(features)
            features.remove(feature_name)
            features = np.array(features)

        mask = X[features].sum(axis=1) == 0
        X.loc[mask, features] = np.nan

y = df.iloc[:, [-1]]

# 按照当前列缺失值的数量进行升序排列
sortindex = np.argsort(X.isnull().sum(axis=0)).values
for i in sortindex:
    # 将当前列作为目标值
    feature_i = X.iloc[:, i]

    # 将其余列作为特征值(包括目标值)
    tmp_df = pd.concat([X.iloc[:, X.columns != i], y], axis=1)
    # 使用众数填充其余列缺失值
    imp_mf = SimpleImputer(missing_values=np.nan, strategy='most_frequent')  
    tmp_df_mf = imp_mf.fit_transform(tmp_df)

    # 将feature_i中非空的样本作为训练数据
    y_notnull = feature_i[feature_i.notnull()]  
    y_null = feature_i[feature_i.isnull()]   
    X_notnull = tmp_df_mf[y_notnull.index, :]  
    X_null = tmp_df_mf[y_null.index, :] 

    # 如果没有缺失值则下一列
    if y_null.shape[0] == 0:
        continue

    # 建立随机森林回归树进行训练
    rfc = RandomForestRegressor(n_estimators=100)
    rfc = rfc.fit(X_notnull, y_notnull)

    # 对缺失值进行预测
    y_predict = rfc.predict(X_null)

    # 填充缺失值
    X.loc[X.iloc[:, i].isnull(), X.columns[i]] = y_predict

That's all for today's content. For more content, you can pay attention to "New Folder X".

Guess you like

Origin blog.csdn.net/ZackSock/article/details/122200619