Full analysis of second-hand car transaction price prediction code (2) Data analysis and feature engineering

When faced with failure, do not hesitate; when faced with choices, do not hesitate; when faced with challenges, do not be afraid.

View missing and duplicate values

Let’s talk about what are missing values here. For example, a certain column clearly should have a number, but there is no number at all. This will cause an error to be reported when the program is running, prompting that the conversion cannot be performed NaN(Not a Number).

There are a lot of missing values in the data set provided by Tianchi. When I first started running the program, NaN errors were reported everywhere. So it is necessary to check the missing values. The following code is used to check the missing values and duplicate values :

missing=data_all.isnull().sum()
missing=missing[missing>0]
print(missing)
print(data_all['bodyType'].value_counts())
print(data_all['fuelType'].value_counts())
print(data_all['gearbox'].value_counts())

Explain isnull().sum() and value_counts() here.
isnull(): This function will return a Boolean type (i.e. Trueor False) matrix, indicating whether the corresponding position of the original matrix is a NaN missing value.

isnull().sum():sum() will accumulate based on the Boolean matrix returned by isnull(). We all know that in the underlying implementation of the language, True=1, False=0.
Therefore, isnull().sum() outputs how many missing values there are in each column of the matrix (because as long as there is one missing value, it is True. Suppose there are Q missing values in a certain column, then the value of this column output by sum isQ*1=Q ) . It directly tells us the number of missing values in each column.

value_counts():This function is used to see how many different values there are in a column of the table and calculate how many duplicate values there are for each different value in the column. The returned content is equivalent to a list.

Data analysis of numerical features

Numerical features mean that the expression of the feature must be a specific numerical value, corresponding to the classification feature (that is, the numerical value of the classification feature represents the category, such as 0 indicating high quality and 1 indicating inferior quality). In addition, feature engineering also includes time features, text features, etc.

In the code below, num_featureswe select numerical features, including power, price, and some anonymous numerical features. Correspondingly, categorical_featuresthere are classification features, including name, brand, etc.

The function of this code is to print out information such as the different values of each column and the number of unique values. It achieves the observation of the original data set. This saves us the time of opening Excel and looking at it line by line, and facilitates our subsequent operations:

num_features=[ 'kilometer','power','price', 'v_0',
       'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10',
       'v_11', 'v_12', 'v_13', 'v_14']

for num in num_features:
    print('{}特征有{}个不同值'.format(num,data_all[num].nunique()))
    temp=data_all[num].value_counts()
    print(temp)
    print(pd.DataFrame(data_all[num]).describe())

In the above code, describe()the function calculates various data statistics of the data column. It can be understood here that we briefly observe the basic content of the matrix. For detailed parameters of describe(), please refer to:

https://blog.csdn.net/m0_45210226/article/details/108942526

In addition, a new function is used in this part of the code nunique().

In English, unique means "alone" and "only". Therefore, nunique here is n unique, that is, "there are n unique values".

Let’s add some knowledge about pandas. unique()There are two methods in pandas nunique():
(1) unique()Returns all unique values of the selected column (all unique values of the feature) in array form (numpy.ndarray).
(2) nunique()What is returned is the number of unique values.

Data analysis of classification features

After talking about numerical features, of course next is the categorical features!

Here categorical_featuresare classification features, including name, brand, etc.

This code is similar to the code in the numerical feature part above. Its purpose is to observe the basic situation of the classification feature data:

categorical_features=['model','name', 'brand', 'notRepairedDamage','bodyType', 'fuelType', 'gearbox','regionCode']
for cat in categorical_features:
    print('{}特征有{}个不同值'.format(cat,data_all[cat].nunique()))
    temp=data_all[cat].value_counts()
    print(temp)
    print(pd.DataFrame(data_all[cat]).describe())

Clean up outliers

The data set provided by Tianchi is really confusing. It contains not only NaN, but also special characters such as "-". A good numeric matrix becomes a string matrix after adding "-". This is so annoying, and the data cannot be analyzed.

When classmate Q and I ran the code together, we were bald on this for a long time. During this period, I also used Excel to violently clean up the "-" symbols, but the replacement efficiency of Excel was too slow Orz

Here we can first enter a piece of code to check the data type:

print(data_all.info())

The output results are as follows:
Insert image description here
You can see notRepairedDamagehow confusing this column of data is! Others are all floator intnumerical types, only it is an object type!

The machine learning model operates on a numerical matrix, so the object type cannot be parsed and needs to be replaced.

Our approach here is to first replace all "-" with NaN, and then fill in all the values by filling in NaN.

Therefore, ‘-’the code to clean up abnormal values is as follows:

data_all['notRepairedDamage'].replace('-', np.nan, inplace=True)

replace(): Replace the first parameter in the object with the second parameter, here replace '-' with NaN. This function has one parameter inplace, which in English means "change in place". If set to false, the original object will not be changed when replacing, but a modified new object will be returned. If it is set to True, it will be modified directly on the original object, so we need to set it to True.

Eliminate features that are not obvious in the data

According to the tutorial on Tianchi Datawhale, the category features of these two columns are seriously skewed and generally will not be helpful for prediction, so these two columns can be deleted seller.offerType

drop()This method can delete specified rows and columns from the DataFrame object by listing the column names to be deleted in the first parameter. Also, axis=1when columns are deleted, rows axis=0are deleted when.
The specific data deletion code is as follows:

#删掉没用的两列
data_all=data_all.drop(['seller','offerType'],axis=1)

for cat in categorical_features:
    data_all[(data_all['type']=='train')][cat].value_counts().plot(kind='box')
    plt.show()

for num in num_features:
   sns.distplot(data_train[num],kde=False,fit=stats.norm)
   plt.show()

matplotlibIn this part of the code, two loops for drawing are written . Their function is to make box plots for each column of numerical features or categorical features to observe the data distribution.

sns.distplot()It integrates the functions of matplotlib hist()and sns.kdeplot()its function is to draw histogram + kernel density curve. Here we just need to understand its application.
Details can be viewed at:

https://blog.csdn.net/pythonxiaopeng/article/details/109642444

Post two drawn images:
Insert image description here
The above picture is brandthe data distribution chart of trademarks. It seems that the gap is quite big. The picture below is the data we just cleaned notRepairedDamage. It looks quite friendly, doesn’t it?

View feature distribution

This part is to call sns.kdeplotand check the distribution of our main features.
sns.kdeplot()The function is used to draw a graph to compare the relationship between two variables to see if it conforms to linear regression. Generally used to compare feature variables and label variables. He has a professional term called Kernel density estimation (Kernel density estimaton).

The so-called sns, is actually our import seaborn as snsresult. seabornIt is a matplotlibPython graphics library similar to Python, which is also used to draw pictures. It is actually matplotliba more advanced API encapsulation based on , making drawing easier.
The code to plot the feature distribution is as follows:

feature=[ 'name',  'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
        'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9',
        'v_10', 'v_11', 'v_12', 'v_13', 'v_14']
for i in feature:
    g=sns.kdeplot(data=data_all[i][(data_all['type']=='train')],color='Red',shade=True)
    g = sns.kdeplot(data=data_all[i][(data_all['type'] == 'test')],ax=g, color='Blue', shade=True)
    g.set_xlabel(i)
    g.set_ylabel("Frequency")  #纵轴即特征变量，具体数值与核密度估计数学实现有关
    g = g.legend(["train", "test"])
    plt.show()

Here I show the results I drew. It actually draws the basic distribution of the training set and test set data, which looks clear at a glance.
Insert image description here
The above picture is the feature distribution of name, and the bottom picture is the feature distribution of brand (trademark). This image is used as an auxiliary observation, just take a look at it.

Analyze the correlation of features and filter out those with low correlation

This part of the code corresponds to feature screening and is used to calculate correlation.
corr()The function is used to check the direction and extent of the change trend between two variables. The value range is -1 to +1. 0 means that the two variables are not related, a positive value means a positive correlation, and a negative value means a negative correlation. The larger the value, the greater the correlation. powerful.

methodThe parameters can be pearson, spearman, kendall, pointbiserialr and other correlation coefficients. The Spearman correlation coefficient is used here. If you want to optimize, as for which one to choose, this is a mathematical consideration! Let's do the application first.
The correlation calculation code is as follows:

feature=['price','name',  'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
        'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9',
        'v_10', 'v_11', 'v_12', 'v_13', 'v_14']
corr=data_all[feature].corr(method='spearman')
corr=pd.DataFrame(corr)   #用pandas的DataFrame封装一下相关性矩阵
#这是因为seaborn库是基于pandas的，所以必须用pandas的数据类型嘛

sns.heatmap(corr,fmt='0.2f')  #fmt就是数据格式format，代表输出的数据类型
plt.show()
# 删掉相关性比较低的特征
data_all=data_all.drop(['regionCode'],axis=1)

This program draws a heatmap(heat map) using sns.
As for what it is heatmap, I will put the drawn results and everyone will understand:
Insert image description here
you can see at a glance that regionCodethe correlation between (region code) and other factors is not high enough (all correlations are almost 0), so in this code, We dropdelete this column using

View time features

The time characteristic is also a very important feature, because we will later use the subtraction of two time characteristic variables to calculate the service life of the car. Here we call pd.to_datetime(), this function can convert Str and Unicode into time format, because the times in the original data set are all strings, such as 20210330, not 2021-03-30 we want. pd.to_datetime()It can help us achieve this automatic conversion from characters to dates!

The specific date conversion code is as follows:

data_all['regDate']= pd.to_datetime(data_all['regDate'], format='%Y%m%d', errors='coerce')
data_all['creatDate']=pd.to_datetime(data_all['creatDate'], format='%Y%m%d', errors='coerce')
print(data_all['regDate'].isnull().sum()) 
# 在regDate里，有一些数据的格式出错，如20070009，可是我们没有0月
#所以我们需要 errors='coerce'。errors遇到错误会将该值设置为NaN，方便我们后续补全空值
print(data_all['creatDate'].isnull().sum())  #查看有多少空行

Next comes the more critical step, which is also a typical approach in feature engineering data processing:

data_all['used_time']=(data_all['creatDate']-data_all['regDate']).dt.days
data_all=data_all.drop(['SaleID','regDate','creatDate','type'],axis=1)#删掉不需要的列
# 使用时间：data['creatDate'] - data['regDate']，反应汽车使用时间，一般来说价格与使用时间成反比
# 不过要注意，数据里有时间出错的格式，

Here we construct a new used_timecolumn. As the name suggests, it represents the usage time of the car. It is equivalent to refining the original creatDate(online time) and regDate(registration time) information.

See brand and price features

This part of the code is simpler than what was written on Tianchi Datawhale. It clusters the two columns of brand and model and then averages them. Explain what the code means:

groupby()The function combines elements with the same value in the specified column into a group, so the result it returns is a matrix distributed by groups. The code is equivalent to averaging the prices of cars of the same brand, and also calculating the average and median prices of cars of the same model.

mean()The function is used to average. The default is to average vertical columns. If you want to average rows, use mean(1).

median()What is found is the median, and the usage is mean()similar to.

brand_and_price_mean=data_all.groupby('brand')['price'].mean()
model_and_price_mean=data_all.groupby('model')['price'].mean()
brand_and_price_median=data_all.groupby('brand')['price'].median()
model_and_price_median=data_all.groupby('model')['price'].median()
data_all['brand_and_price_mean']=data_all.loc[:,'brand'].map(brand_and_price_mean)
data_all['model_and_price_mean']=data_all.loc[:,'model'].map(model_and_price_mean).fillna(model_and_price_mean.mean())
data_all['brand_and_price_median']=data_all.loc[:,'brand'].map(brand_and_price_mean)
data_all['model_and_price_median']=data_all.loc[:,'model'].map(model_and_price_mean).fillna(model_and_price_median.mean())

The last four lines of code use a clever function map() in pandas. Its function is to reassign the value in the calling object according to the key-value pair in the map (key-value pair, in the form of key:value, similar to a dictionary). (Professionally called mapping). Regarding the basic application of map, you can borrow a picture to express it:
Insert image description here
you can see that map actually provides a dictionary, and the caller can get the corresponding new value by searching the picture based on his own internal data.

brand_and_price_meanTherefore, it is not difficult to understand that in this code, the model_and_price_mean, , brand_and_price_medianand model_and_price_medianthese new attribute columns of the same car are actually assigned the same value according to the car's brand price.

I personally think this is more like an integration of features and the construction of some new features. This is also in line with several construction directions of feature construction in feature engineering:

Construct statistical features to report counts, sums, proportions, standard deviations, etc.;
Time characteristics, including relative time and absolute time, holidays, weekends, etc.;
Geographic information, including binning, distribution coding and other methods;
Nonlinear transformation, including log/square/root, etc.;
Feature combination, feature intersection;
The benevolent sees benevolence, the wise see wisdom, and the structure is constructed according to the actual situation.