Full analysis of second-hand car transaction price prediction code (1) Background introduction and data set loading

What you get on paper is shallow, but you know you have to do it in detail

Background introduction

In the past two days, with the help of classmate Q, I successfully got into machine learning and started studying the introductory data mining competition on Tianchi. The first thing I came into contact with was the prediction of second-hand car transaction prices . I originally thought it was a relatively simple project. I applied Keras's model to solve the Boston housing price regression prediction, wrote a simple neural network and input the data. That's it.
Therefore, I have always thought that the basic steps of deep learning are:
(1) Data preparation
(2) Model selection & model development
(3) Model evaluation (evaluate)
(4) Model prediction
(5) Model tuning (parameter adjustment)

I never expected that in order to predict the used car transaction price accurately, the basic machine learning method must be optimized. Personally, I think machine learning is too difficult compared to deep learning. The difficulty lies in the fact that the bottom layer is all mathematics. The entire process includes:
(1) Data analysis
(2) Feature engineering
(3) Model training
(4) Model parameter adjustment

Feature engineering involves many data processing methods. Its main purpose is to represent the data in a form that is more useful to the model, so as to improve the performance of the model.

I read the book "Python Deep Learning" by François Cholet, the father of keras, and the definition of feature engineering is:

“Feature engineering refers to using your own knowledge about the data and machine learning algorithms (including neural networks) to apply hard-coded transformations (not learned by the model) to the data before feeding it into the model to improve the performance of the model.”

Let’s borrow a large picture for understanding:
Insert image description here

In most cases, a machine learning model cannot learn from completely arbitrary data (without any preprocessing)... For example, if you want to predict an image of a clock and analyze the time in the image based on what the pointer points to, then you can The corresponding position of each needle tip is converted into (x, y) coordinates, allowing a simple machine learning algorithm to predict the time without writing a complex image recognition program.

The most difficult difficulty lies in this feature engineering. Because I don’t know much about some concepts of mathematical statistics, such as data bucketing, box plots, etc., I can’t understand it even if I look at the code provided by Tianchi. I feel like this is the first time I have encountered such a detailed data processing problem since I came into contact with scientific research. I had to adopt a step-by-step approach and analyze the code line by line from the beginning.

Import related libraries

Don't worry, everyone, the code I posted will definitely work when put together.

The first step is of course to import the necessary packages. Let’s explain each one:

(1) numpy: Everyone should be familiar with numpy. It is a matrix operation library encapsulated in python. It can complete operations between matrices. It is very similar to the matrix type in matlab in all aspects.

(2) pandas: pandas is a structured data tool based on numpy. You can simply understand that it is higher-level than numpy. The simplest example is that the DataFrame type of pandas is more structured than the array matrix of numpy.

(3) seabornand matplotlib: These two are Python drawing tool libraries, among which seaborn is based on matplotlib, and drawing is more advanced. seaborn is to matplotlib what pandas is to numpy.

(4) missingno: As the name suggests, missing number. missingno is a missing value visual analysis module that can visualize the missing effects of invalid data or missing data. So, is this library very powerful?

(5) scipy: scipy is a scientific computing library, which is mainly responsible for statistical analysis, linear algebra calculations, and calculus calculations. For example, we will use it in our code scipy.stats, which contains random variables with various probability distributions for easy use.

(6) time: Processing time, because there is time in the source data, and the time will be processed later. time returns the current time in seconds since timing as a floating point number.

(7) xgboost: The encapsulated machine learning model can be called directly during training. To be more official, XGBoost is an optimized distributed gradient enhancement library. Currently, a large number of Kaggle players choose XGBoost for data mining competitions.

(8) sklearn: In the same way, it is also a machine learning tool library pre-packaged in Python. It is built on NumPy, SciPy, Pandas and Matplotlib. Many machine learning model methods have been implemented and can be called. It is very suitable for me. Kind of newbie.

import numpy as np
import pandas as pd
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
from scipy import stats
import time
import xgboost as xgb
from sklearn.metrics import roc_auc_score,mean_absolute_error
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import cross_val_score,train_test_split
import sklearn.preprocessing as preprocessing

You can also see from the above code that from sklearn importthese packages are tools/models/components used in machine learning, such as RandomForestClassifierrandom forest, mean_absolute_errorwhich is the familiar mae (mean absolute error, which is a commonly used regression problem assessment metrics).

Load dataset

Next is data loading.
Use pandas to load the data set given by Tianchi, and connect the representative train training set and test test set matrices data_traintogether data_testto facilitate subsequent processing.

data_train=pd.DataFrame(pd.read_csv(r'F:/kaggle天池/used_car_train_20200313.csv',sep=' '))
data_test=pd.DataFrame(pd.read_csv(r'F:/kaggle天池/used_car_testB_20200421.csv',sep=' '))
data_train['type']='train'  #定义一个type属性便于区别训练集测试集
data_test['type']='test'
print(data_train.info(verbose=True,null_counts=True))
print(data_test.info(verbose=True,null_counts=True))
print(data_train.head())#打印表的前几行
print(data_test.head())
print(data_train.shape)#查看矩阵的尺寸形状
print(data_test.shape)
data_all=pd.concat([data_train,data_test],axis=0,ignore_index=True,sort=False)#按行连接训练集和测试集

Here are several important methods:
head(): What should you do when a data table is relatively large, but you want to see its general situation? At this time, you can use head(). Head() prints out the first five rows of the table by default, so that we can have a general understanding of the data. If you want to type the first ten lines, just write head(10)!

info(): This function is used to print a brief summary of the DataFrame, showing information about the DataFrame, including the data type dtype of the index and the data type dtype of the column, the number of non-null values ​​and memory usage. verboseThe parameter is translated as "verbose" in Chinese, and it determines whether to print a complete summary!

pd.concat(): This function will join two matrices together. The axis value inside represents the dimension of the connection, axis=0which means connecting the two matrices vertically (splicing down the rows); axis=1connecting the two matrices horizontally (splicing the columns right).

pd.DataFrame(): Pandas' own method can convert a matrix (array) into a pandas-specific DataFrameformat. DataFrameThe arraybiggest difference is that DataFrame is a matrix with a header, which can be a string or a number. For details, please refer to the figure below:
The index in the picture is the header
In the figure, index represents the row index, column represents the column index, and values ​​represents the numerical value. I will talk about how to use it specifically DataFramelater. A more concrete DataFrameexample is:
Insert image description here
But if your matrix is ​​of numpy arraytype, the matrix becomes:
Insert image description here
So the most intuitive difference is that DataFrame has a header. So operating Excel or CSV is very convenient and intuitive.

Guess you like

Origin blog.csdn.net/zoubaihan/article/details/115320224