Used Car Transaction Forecast
Experimental purpose and requirements
- Combined with problem understanding, describe three models applicable to this problem.
- Master the basic process of data mining, including data analysis and preprocessing, feature process, model training and testing. Some steps for reference are given in the experiment content file. You can choose or play freely to complete the work at each stage.
- Finally, upload the prediction result file to the competition website for testing, take a screenshot of the result, and record the score and ranking.
Test environment
In this experiment, a PC is used for data analysis, and a server is used to train the model and make predictions.
[PC Configuration]
CPU: 11th Gen Intel® Core™ i7-11700K @ 3.60GHz 3.60 GHz
GPU: NVIDIA GeForce RTX 3060
Operating System: Windows 10 Professional Edition
[Server configuration]
CPU: Intel® Xeon® CPU E5-2680 v4 @ 2.40GHz
GPU: 2 NVIDIA A100
Operating system: Linux
Experimental content and process
1. Data analysis
1. Basic data analysis:
First introduce the necessary libraries, and then merge the training set data and test set data for subsequent operations.
View the first few rows of data:
Through analysis and consulting information, we can see that the data contains 31 fields, including 15 anonymous variables, and the rest of the variables are shown as follows:
field name | significance |
---|---|
SaleID | Transaction ID, unique code |
name | Car trade name, desensitized |
regDate | Date of car registration, e.g. 20160101, January 01, 2016 |
model | Model code, desensitized |
brand | Car brand, desensitized |
bodyType | Body Types: Limousine: 0, Mini: 1, Van: 2, Bus: 3, Convertible: 4, Coupe: 5, Van: 6, Mixer: 7 |
fuelType | Fuel Type: Petrol: 0, Diesel: 1, LPG: 2, Natural Gas: 3, Hybrid: 4, Other: 5, Electric: 6 |
gearbox | Transmission: Manual: 0, Automatic: 1 |
power | Engine Power: Range [ 0, 600 ] |
kilometer | Kilometers traveled by the car, unit 10,000 km |
notRepairedDamage | Car has unrepaired damage: Yes: 0, No: 1 |
regionCode | Area code, desensitized |
seller | Seller: Individual: 0, Non-Individual: 1 |
offerType | Offer Types: Offers: 0, Requests: 1 |
creatDate | The time when the car goes online, that is, the time when the sale starts |
price | Used car transaction price (forecast target) |
First, roughly observe the data:
use describe to get an overview of the data
Familiarize yourself with the data type through info().
It is found that all the features except "notRepairedDamage" are numerical features, indicating that this feature needs to be numerically processed later.
2. Feature correlation analysis:
Plot a heatmap of feature correlations. It is found that the features with a relatively high correlation with price are regDate, anonymous features v_0, v_3, v_8, and v_12, and these attributes need to be processed in feature engineering.
3. Eigenvalue repeatability analysis:
Counts the number of unique values in each feature. For features with a high degree of repetition, subsequent missing values can be considered using the method of mode filling.
4. Statistical analysis of missing values:
Next, count the missing data and abnormal situations:
it can be found that the three fields of bodyType, fuelType, and gearbox are missing a lot.
data visualization
5. Price distribution analysis:
Drawing the histogram and density map of the price, it can be concluded that the price is roughly distributed in the long tail, indicating that the data needs to be normalized in the future.
After normalization:
2. Data preprocessing
1. Missing value processing:
Through the statistics of the repeatability of the above data, it is found that most of the features with missing values have a relatively high repeatability, so the method of filling in the mode is used to supplement the missing values.
2. Numerical non-numerical features
The only non-numeric feature is "notRepairedDamage", which has three values: '-', '0.0', and '1.0'. Replace the value '-' with '0.0', and convert the feature value to a float type.
3. Outlier processing
The range of the power characteristic power is between 1 and 600, so the value beyond the range is truncated within the normal range.
It is found that the eigenvalues of the anonymous features v_13 and v_14 are concentrated in a certain range, so the few outlier points that exceed the range are truncated.
3. Feature engineering
1. Combining anonymous features:
Some anonymous features are highly correlated with prices, indicating that anonymous features are important. New features are constructed through the combination of anonymous features and the combination of anonymous features and other features, which is convenient for subsequent use and analysis.
2. Extract date information
Date information has two characteristics of "regDate" and "creatDate". By defining the date parsing function date_tran(), the date is split into year, month, and day.
3. Count encoding of features
Count the number of values of some features, and form a new feature with the number of features and their values. Define the function count_coding() to count the number of feature values.
4. The characteristic structure of the date
Based on the two features of registration date and creation date, construct more meaningful features, such as the age of the car, the number of days between the registration date and the current date, and the number of days between the creation date and the current date. Since the obtained days are discrete values, they need to be processed in buckets. Define the function cut_group() for bucketing.
5. Feature crossover
Use numerical features to statistically describe category features. Here, several anonymous features with the highest correlation with price are selected. Define the function cross_cat_num() for first-order crossover.
6. Feature encoding
Since there are categorical features such as car name, car brand, and regional coding, and these features are high-cardinality qualitative features, it is necessary to encode these categorical features. Common methods use one-hot encoding and label encoding, but these two methods are too simple to produce good results, and tend to consume a lot of memory and training time, so here I use two that are more suitable for high-cardinality qualitative features Coding methods: mean coding and target coding.
① Mean encoding
Define the class of mean encoding, declare the features that need to be encoded, call the function to fit the features and labels of the training set, and finally encode the test set.
②Target encoding
First give the default value of each encoding. Here, three codes of maximum value, minimum value and average value are selected. Object encoding by combination with categorical features.
③Feature normalization
Merge the training set and test set, and use MinMaxScaler to normalize the data.
④ Feature Dimensionality Reduction
Use PCA to reduce the dimensionality of the features, and finally assign the processed data back to the training set and test set.
4. Prediction Model
1. Problem analysis
This problem belongs to the prediction of the transaction price of used cars, and it is a typical supervised regression problem, which can be processed by regression models, ensemble methods and other algorithms.
2. Available models
For this problem, models such as lightgbm, catboost, and neural network can be used for training and prediction. Both lightgbm and catboost belong to the tree model. Their advantage is that the training convergence speed is fast, and the regularization coefficient and learning rate can be adjusted during the training process. The disadvantage is that it takes up a lot of memory space and takes a long time. In this experiment, I used the neural network method for training and prediction. Not only can the neural network use small-scale data during training, but also its convergence speed has a large benefit, and its accuracy is close to that of the tree model. In addition, optimization schemes such as the Adam optimizer can reduce memory consumption, improve computing efficiency, and handle features well.
3. Training model and prediction
① Define the neural network model constructor, here I use a six-layer neural network structure
② Define the learning rate adjustment function, every 100 epochs, the learning rate is reduced to 1/10 of the original
③ Start training and prediction. Regularization is used during training to prevent overfitting. At the same time, the error in the training process is analyzed, and the learning rate is adjusted when the learning rate drops. In addition, six-fold cross-validation is used to reduce overfitting. Here I also use Adam to optimize the gradient descent. And save the smaller loss as a submission file
5. Adjust parameters and test
The predicted model this time contains parameters such as epoch, batch, learnRate, and number of network layers. In order to obtain better results, I checked the information and adjusted the parameters many times. Due to the limited number of submissions, I selected a few groups with relatively small loss after local testing to submit for tuning and scoring.
6. Final submission results:
After a series of adjustments, it is found that when the batch is 2000, the epoch is 145, and the five-layer neural network model with sizes of 300, 64, 32, 8, and 1 is used, the upper division effect is the best. Submit the final prediction results from the competition website, and finally my score is 400.3377 points, ranking 119th .
Experimental Harvest
In this experiment, I participated in the "Used Car Price Prediction Competition" held by Ali Tianchi. I learned to independently conduct mathematical analysis, data cleaning and choose an appropriate model for prediction by consulting materials and books. This gave me a more comprehensive understanding of data mining. Here, I have mastered several common data analysis and data cleaning methods, learned, understood and practiced commonly used forecasting models.
In addition, there is no reasonable concept for the parameters (learning rate, batch, epcho, etc.) used in this experiment. I adjusted the parameters manually, and selected better results to submit multiple times to brush the list, and finally achieved good results. .