Full analysis of second-hand car transaction price prediction code (3) feature engineering and missing value processing

The road is long and far away, I will search up and down.

Missing value processing ideas

Let’s review the content of Section 2 first. In the second section, we talked about feature construction. We not only analyzed the correlation between features, deleted useless features, but also constructed some new features. For example used_time(duration of use), brand_and_price_mean(brand and price), etc.

When we construct a new feature, we use fillna()functions to fill in the missing values ​​in the new feature. But we should also note that there are still many missing values ​​in our original feature columns, which need to be processed. The idea here is:

(1) The missing number is very small: directly replace it with the median or average.
(2) The number of missing values ​​is large: use a machine learning model to predict missing values.

In addition, regarding missing values, we also need to consider the following issues:
(1) There may be missing values ​​in the training set and test set, so we have to fill them in.
(2) We need to use existing values ​​to predict missing values. After all, NaNprediction cannot be used NaN. Right

In this step, we first delete some columns that are not relevant to the prediction:

data_all=data_all.drop(['SaleID','regDate','creatDate','type'],axis=1)

SaleIDis the transaction ID, regDateand we have already constructed creatDatethe new features of and ; it is the label that we defined at the beginning of the code to distinguish the train training set and the test set. It is definitely useless for prediction, so remove these columns first.used_timetype

Next, let's check which columns have missing values ​​and how many are missing:

print(data_all.isnull().sum())   #检查每一列有多少缺失值

The printed results are as follows:
Insert image description here

Insert image description here
Good guys, from this result we can see that modelthere is only one missing value in the column, and there are a lot of missing values ​​in these columns, , , , , which are in bodyTypethe fuelTypetens gearboxof notRepairedDamagethousands price.used_time

Filling of very few missing values

So for the column with the fewest missing values model, we directly fill it with the median:

data_all['model']=data_all['model'].fillna(data_all['model'].median())

For bodyTypeseveral attributes with a particularly large number of missing values, we need to use a machine learning model to predict and supplement the missing values. This is equivalent to modeling each column containing missing values ​​separately to predict its absence, such as building a bodyType prediction model. , fuelType prediction model...

Large numbers of missing values: Filling in with machine learning model predictions

First, we preprocess bodyTypethe data required to build the prediction model (train training set and test test set) and delete columns containing a large number of missing values. Because NaN cannot be used to predict NaN, and NaN cannot be used in the training set for training. The preprocessing code is as follows:

# 处理bodyType
X_bT=data_all.loc[(data_all['bodyType'].notnull()),:]   
#先找出所有bodyType行,按True和False(空和不空)分类
X_bT=X_bT.drop(['bodyType','fuelType','gearbox','notRepairedDamage','price','used_time'],axis=1)#XbT删去了空值较多的列
ybT_train=data_all.loc[data_all['bodyType'].notnull()==True,'bodyType']   #只选择bodyType不为空的列
XbT_test=data_all.loc[data_all['bodyType'].isnull()==True,:]
XbT_test=XbT_test.drop(['bodyType','fuelType','gearbox','notRepairedDamage','price','used_time'],axis=1)

In the above code:

notnull()==TrueThe way of writing is to select the rows that are not empty; isnull()==Trueit is to select the rows that are empty. The reason we mentioned in the second section is because notnull()the sum isnull()function returns a matrix composed of Truesum .False

X_bT: It is the training set we use to train the bodyType prediction model. That is, use several other features without missing values ​​to predict bodyTypethe value of the column. So we delete the features that contain a lot of missing values ​​and are also urgently needed to predict in the second line of code.

ybT_train: Of course it is the target value we want to predict! Of course, the predicted value must be a real number and not NaN, so here we call notnull()the function to only remove bodyTypethe rows that do not contain missing values ​​in the original column.

XbT_test: This is easy to understand, it is the test set we process ourselves. Here, all the empty row data in bodyType are put into XbT_test, and drop()several other missing columns to be predicted are removed ['bodyType','fuelType','gearbox','notRepairedDamage','price','used_time'], leaving only the necessary bodyTypefeatures for prediction.

In the same way, fuelTypethe gearboxdata preprocessing of several other features such as , and so on is also done exactly the same way. The code is as follows:

# 处理fuelType
X_fT=data_all.loc[(data_all['fuelType'].notnull()),:]
X_fT=X_fT.drop(['bodyType','fuelType','gearbox','notRepairedDamage','price','used_time'],axis=1)
yfT_train=data_all.loc[data_all['fuelType'].notnull()==True,'fuelType']
XfT_test=data_all.loc[data_all['fuelType'].isnull()==True,:]
XfT_test=XfT_test.drop(['bodyType','fuelType','gearbox','notRepairedDamage','price','used_time'],axis=1)

#处理gearbox
X_gb=data_all.loc[(data_all['gearbox'].notnull()),:]
X_gb=X_gb.drop(['bodyType','fuelType','gearbox','notRepairedDamage','price','used_time'],axis=1)
ygb_train=data_all.loc[data_all['gearbox'].notnull()==True,'gearbox']
Xgb_test=data_all.loc[data_all['gearbox'].isnull()==True,:]
Xgb_test=Xgb_test.drop(['bodyType','fuelType','gearbox','notRepairedDamage','price','used_time'],axis=1)

#处理notRepairedDamage
X_nRD=data_all.loc[(data_all['notRepairedDamage'].notnull()),:]
X_nRD=X_nRD.drop(['bodyType','fuelType','gearbox','notRepairedDamage','price','used_time'],axis=1)
ynRD_train=pd.DataFrame(data_all.loc[data_all['notRepairedDamage'].notnull()==True,'notRepairedDamage']).astype('float64')
XnRD_test=data_all.loc[data_all['notRepairedDamage'].isnull()==True,:]
XnRD_test=XnRD_test.drop(['bodyType','fuelType','gearbox','notRepairedDamage','price','used_time'],axis=1)

Interestingly, used_timethe data processing of time features is slightly different from the above features:

#处理used_time
scaler=preprocessing.StandardScaler()
uesed_time=scaler.fit(np.array(data_all['used_time']).reshape(-1, 1))
data_all['used_time']=scaler.fit_transform(np.array(data_all['used_time']).reshape(-1, 1),uesed_time)

X_ut=data_all.loc[(data_all['used_time'].notnull()),:]
X_ut=X_ut.drop(['bodyType','fuelType','gearbox','notRepairedDamage','price','used_time'],axis=1)
yut_train=data_all.loc[data_all['used_time'].notnull()==True,'used_time']
Xut_test=data_all.loc[data_all['used_time'].isnull()==True,:]
Xut_test=Xut_test.drop(['bodyType','fuelType','gearbox','notRepairedDamage','price','used_time'],axis=1)

In the above code, the first three lines in short are to used_timeperform a standardization and normalization operation on the data of this time type. to facilitate subsequent calculations.

In addition, the numpy function is used in the code reshape(). The first parameter of reshape is -1, which means the number of rows is uncertain (because the number of columns is fixed, so the specific number of rows is calculated by the computer itself), and the second parameter is 1, which means the new The matrix has only one column. So here is to convert the normalized time data into a column.

scaler.fit()and are used here scaler.fit_transform(), they are all sklearnstatistical tools packaged in the library, the specific explanation is as follows:

fit(): Simply put, it is to obtain the inherent properties of the training set such as the mean, variance, maximum value, and minimum value of the training set.

transform(): On the basis of fit, perform standardization, dimensionality reduction, normalization and other operations (see which tool is used, such as PCA, StandardScaler, etc.).

fit_transform():fit_transform is a combination of fit and transform. It will first fit the data to find the overall indicators of the data, such as mean, variance, maximum and minimum values, etc., and then transform the data set to achieve standardization and normalization of the data. Unified operation.

transform()The function of both fit_transform()is to perform some unified processing on the data (such as standardization ~N(0,1), scaling (mapping) the data to a fixed interval, normalization, regularization, etc.).

Fill missing values ​​with Xgboost

After processing the data, now of course we need to start building our machine learning prediction model!
Let's first define a prediction function, which xgb.XGBRegressoris xgboostimported from the library. As the name suggests, it applies the xgb algorithm to regression problems. It can be used directly by adding parameters. Isn’t it very convenient?
The specific prediction code is as follows:

def RFmodel(X_train,y_train,X_test):
    model_xgb= xgb.XGBRegressor(max_depth=4, colsample_btree=0.1, learning_rate=0.1, n_estimators=32, min_child_weight=2)#模型定义
    model_xgb.fit(X_train,y_train)#拟合
    y_pre=model_xgb.predict(X_test)#预测
    return y_pre

XGBRegressorThe first line of code in the function means that a model object is declared . The second line is to input the training data we processed into the model for fitting.

In this way, the function return value y_pre is the target we use to fill in the missing values!

Predict missing values ​​for each feature

After defining our prediction function, we can actually predict missing values. The data entered here is the data we just processed for each feature. Then assign the predicted value to the position of the original missing value in the matrix to complete the supplement of missing values.

y_pred=RFmodel(X_bT,ybT_train,XbT_test)
data_all.loc[data_all['bodyType'].isnull(),'bodyType']=y_pred

y_pred0=RFmodel(X_fT,yfT_train,XfT_test)
data_all.loc[data_all['fuelType'].isnull(),'fuelType']=y_pred0

y_pred1=RFmodel(X_gb,ygb_train,Xgb_test)
data_all.loc[data_all['gearbox'].isnull(),'gearbox']=y_pred1

y_pred2=RFmodel(X_nRD,ynRD_train,XnRD_test)
data_all.loc[data_all['notRepairedDamage'].isnull(),'notRepairedDamage']=y_pred2

y_pred3=RFmodel(X_ut,yut_train,Xut_test)
data_all.loc[data_all['used_time'].isnull(),'used_time']=y_pred3

The feature engineering part ends here. For most of us beginners, the code implementation in this part requires a high degree of mastery of libraries such as pandas and sklearn. Therefore, there is often a piece of code that takes a long time to understand or cannot be understood. Understand the operating principles behind the situation.

Excellent feature engineering has a very good auxiliary effect on the performance of the model. Although current deep learning pays more attention to the model structure, perhaps the optimization of feature engineering also has great potential in deep learning?

Guess you like

Origin blog.csdn.net/zoubaihan/article/details/115342851