Introduction to Data Mining - Comprehensive Lab

Experimental purpose and requirements

  1. Combined with problem understanding, describe three models applicable to this problem.
  2. Master the basic process of data mining, including data analysis and preprocessing, feature process, model training and testing. Some steps for reference are given in the experiment content file. You can choose or play freely to complete the work at each stage.
  3. Finally, upload the prediction result file to the competition website for testing, take a screenshot of the result, and record the score and ranking.

Test environment

In this experiment, a PC is used for data analysis, and a server is used to train the model and make predictions.

[PC Configuration]
CPU: 11th Gen Intel® Core™ i7-11700K @ 3.60GHz 3.60 GHz
GPU: NVIDIA GeForce RTX 3060
Operating System: Windows 10 Professional Edition

[Server configuration]
CPU: Intel® Xeon® CPU E5-2680 v4 @ 2.40GHz
GPU: 2 NVIDIA A100
Operating system: Linux

Experimental content and process

1. Data analysis

1. Basic data analysis:

First introduce the necessary libraries, and then merge the training set data and test set data for subsequent operations.

View the first few rows of data:
insert image description here
Through analysis and consulting information, we can see that the data contains 31 fields, including 15 anonymous variables, and the rest of the variables are shown as follows:

field name significance
SaleID Transaction ID, unique code
name Car trade name, desensitized
regDate Date of car registration, e.g. 20160101, January 01, 2016
model Model code, desensitized
brand Car brand, desensitized
bodyType Body Types: Limousine: 0, Mini: 1, Van: 2, Bus: 3, Convertible: 4, Coupe: 5, Van: 6, Mixer: 7
fuelType Fuel Type: Petrol: 0, Diesel: 1, LPG: 2, Natural Gas: 3, Hybrid: 4, Other: 5, Electric: 6
gearbox Transmission: Manual: 0, Automatic: 1
power Engine Power: Range [ 0, 600 ]
kilometer Kilometers traveled by the car, unit 10,000 km
notRepairedDamage Car has unrepaired damage: Yes: 0, No: 1
regionCode Area code, desensitized
seller Seller: Individual: 0, Non-Individual: 1
offerType Offer Types: Offers: 0, Requests: 1
creatDate The time when the car goes online, that is, the time when the sale starts
price Used car transaction price (forecast target)

First, roughly observe the data:
insert image description here
use describe to get an overview of the data
insert image description here

Familiarize yourself with the data type through info().
insert image description here
It is found that all the features except "notRepairedDamage" are numerical features, indicating that this feature needs to be numerically processed later.

2. Feature correlation analysis:

Plot a heatmap of feature correlations. It is found that the features with a relatively high correlation with price are regDate, anonymous features v_0, v_3, v_8, and v_12, and these attributes need to be processed in feature engineering.
insert image description here

3. Eigenvalue repeatability analysis:

Counts the number of unique values ​​in each feature. For features with a high degree of repetition, subsequent missing values ​​can be considered using the method of mode filling.
insert image description here

4. Statistical analysis of missing values:

Next, count the missing data and abnormal situations:
insert image description here
it can be found that the three fields of bodyType, fuelType, and gearbox are missing a lot.

data visualization
insert image description here

5. Price distribution analysis:

Drawing the histogram and density map of the price, it can be concluded that the price is roughly distributed in the long tail, indicating that the data needs to be normalized in the future.

insert image description here

After normalization:
insert image description here

2. Data preprocessing

1. Missing value processing:

Through the statistics of the repeatability of the above data, it is found that most of the features with missing values ​​have a relatively high repeatability, so the method of filling in the mode is used to supplement the missing values.
insert image description here

2. Numerical non-numerical features

The only non-numeric feature is "notRepairedDamage", which has three values: '-', '0.0', and '1.0'. Replace the value '-' with '0.0', and convert the feature value to a float type.
insert image description here

3. Outlier processing

The range of the power characteristic power is between 1 and 600, so the value beyond the range is truncated within the normal range.
insert image description here
It is found that the eigenvalues ​​of the anonymous features v_13 and v_14 are concentrated in a certain range, so the few outlier points that exceed the range are truncated.
insert image description here

3. Feature engineering

1. Combining anonymous features:

Some anonymous features are highly correlated with prices, indicating that anonymous features are important. New features are constructed through the combination of anonymous features and the combination of anonymous features and other features, which is convenient for subsequent use and analysis.
insert image description here

2. Extract date information

Date information has two characteristics of "regDate" and "creatDate". By defining the date parsing function date_tran(), the date is split into year, month, and day.
insert image description here
insert image description here

3. Count encoding of features

Count the number of values ​​of some features, and form a new feature with the number of features and their values. Define the function count_coding() to count the number of feature values.
insert image description here
insert image description here

4. The characteristic structure of the date

Based on the two features of registration date and creation date, construct more meaningful features, such as the age of the car, the number of days between the registration date and the current date, and the number of days between the creation date and the current date. Since the obtained days are discrete values, they need to be processed in buckets. Define the function cut_group() for bucketing.
insert image description here
insert image description here

5. Feature crossover

Use numerical features to statistically describe category features. Here, several anonymous features with the highest correlation with price are selected. Define the function cross_cat_num() for first-order crossover.
insert image description here
insert image description here

6. Feature encoding

Since there are categorical features such as car name, car brand, and regional coding, and these features are high-cardinality qualitative features, it is necessary to encode these categorical features. Common methods use one-hot encoding and label encoding, but these two methods are too simple to produce good results, and tend to consume a lot of memory and training time, so here I use two that are more suitable for high-cardinality qualitative features Coding methods: mean coding and target coding.
① Mean encoding
Define the class of mean encoding, declare the features that need to be encoded, call the function to fit the features and labels of the training set, and finally encode the test set.
insert image description here
②Target encoding
First give the default value of each encoding. Here, three codes of maximum value, minimum value and average value are selected. Object encoding by combination with categorical features.
insert image description here
③Feature normalization
Merge the training set and test set, and use MinMaxScaler to normalize the data.
insert image description here
④ Feature Dimensionality Reduction
Use PCA to reduce the dimensionality of the features, and finally assign the processed data back to the training set and test set.
insert image description here

4. Prediction Model

1. Problem analysis

This problem belongs to the prediction of the transaction price of used cars, and it is a typical supervised regression problem, which can be processed by regression models, ensemble methods and other algorithms.

2. Available models

For this problem, models such as lightgbm, catboost, and neural network can be used for training and prediction. Both lightgbm and catboost belong to the tree model. Their advantage is that the training convergence speed is fast, and the regularization coefficient and learning rate can be adjusted during the training process. The disadvantage is that it takes up a lot of memory space and takes a long time. In this experiment, I used the neural network method for training and prediction. Not only can the neural network use small-scale data during training, but also its convergence speed has a large benefit, and its accuracy is close to that of the tree model. In addition, optimization schemes such as the Adam optimizer can reduce memory consumption, improve computing efficiency, and handle features well.

3. Training model and prediction

① Define the neural network model constructor, here I use a six-layer neural network structure
insert image description here

② Define the learning rate adjustment function, every 100 epochs, the learning rate is reduced to 1/10 of the original
insert image description here
③ Start training and prediction. Regularization is used during training to prevent overfitting. At the same time, the error in the training process is analyzed, and the learning rate is adjusted when the learning rate drops. In addition, six-fold cross-validation is used to reduce overfitting. Here I also use Adam to optimize the gradient descent. And save the smaller loss as a submission file
insert image description here

5. Adjust parameters and test

The predicted model this time contains parameters such as epoch, batch, learnRate, and number of network layers. In order to obtain better results, I checked the information and adjusted the parameters many times. Due to the limited number of submissions, I selected a few groups with relatively small loss after local testing to submit for tuning and scoring.

insert image description here

insert image description here
insert image description here

6. Final submission results:

After a series of adjustments, it is found that when the batch is 2000, the epoch is 145, and the five-layer neural network model with sizes of 300, 64, 32, 8, and 1 is used, the upper division effect is the best. Submit the final prediction results from the competition website, and finally my score is 400.3377 points, ranking 119th .

insert image description here
insert image description here
insert image description here

Experimental Harvest

In this experiment, I participated in the "Used Car Price Prediction Competition" held by Ali Tianchi. I learned to independently conduct mathematical analysis, data cleaning and choose an appropriate model for prediction by consulting materials and books. This gave me a more comprehensive understanding of data mining. Here, I have mastered several common data analysis and data cleaning methods, learned, understood and practiced commonly used forecasting models.
In addition, there is no reasonable concept for the parameters (learning rate, batch, epcho, etc.) used in this experiment. I adjusted the parameters manually, and selected better results to submit multiple times to brush the list, and finally achieved good results. .

Guess you like

Origin blog.csdn.net/m0_46326495/article/details/123693175