Tianchi AI Competition Intelligent Manufacturing Prediction Questions

1. A brief introduction to the competition questions and requirements (multiple features and few samples)    

      This competition provides the production data of different processes on the production line (the specific meaning is unknown). Through these data, features are constructed, models are designed, and the corresponding production values ​​are predicted. The evaluation standard adopts MSE, calculates the difference between the predicted value of a single sample and the real value, then sums up the MSE of all samples, and finally takes the average as the evaluation standard.

2. Data description

     Data includes ID column, tool column and data column, they have their own naming format. The data is divided into thirteen groups in form mainly according to the tool column identification. The data between every two toolbars is a process. Because the data column and feature column Y are desensitized, the specific meaning is unknown. Each data column is not arranged in an obvious chronological order, so the order of the data columns in the process is random. Some data have a lot of missing values, and there are also many columns with only one value, and many columns are completely duplicated.

    Within each column, the distribution of the data is clearly influenced by the tool. In some data columns, missing values ​​have been filled with zeros or other outliers.

3. Data preprocessing

1. Sub-process. The entire data set is divided into different steps according to the toolbar. Based on numerical observations, merge the chamber id column with the tool column and combine the operation_id

and chamber as a toolbar for tool division.



2. Data cleaning

Delete single value column, null value column, duplicate column

3. Convert 8-bit, 16-bit and other date formats to seconds since 2016.1.1

4. Fill blank values ​​(0 and NA) with the average of other non-blank values ​​in the same column.

Fourth, feature construction

1. Calculate the one-factor and two-factor interaction columns and include the candidate features;

(1) X is the original value of the feature, Xerr is the difference between the original value and the average value of the column, and the absolute value is taken to get Xerrabs

(2) Establish eigenvectors of X+Y, XY, X/Y, Y/X for two factors (set as X and Y).


2. Screening of alternative features

Since one column yields three features and a pair of columns yields fifteen two-factor features, there are many alternative features. Therefore, an initial screening of features is required. What is used is to calculate the pearson correlation between each column and the eigenvalue Value, and select the one with high correlation to save.

3. Model selection Tried SVR, LASSO, GBDT and model fusion. Finally used xgboost.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324505213&siteId=291194637
Recommended