DataWhale AI Summer Camp - Machine Learning

learning record one

Lithium battery production parameter regulation and production temperature prediction challenge

The environment has been configured, the baseline has been run through, and a simple analysis of the data has been carried out on this basis.

1. Outlier Analysis

Analyze missing values ​​and outliers in the training set

train_data.info()  
train_data.describe()

Observe that there are no missing values ​​and outliers in the data train_dataset['下部温度9'] == -32768.000000]. Remove the missing value.

train_dataset.drop(train_dataset[train_dataset['下部温度9'] == -32768.000000].index).reset_index(drop=True)

2. Univariate boxplot visualization

The data distribution of the flow rate, the upper set temperature and the lower set temperature in the training set and the test set was visualized as a boxplot. It was observed that there were some outliers in the upper
Please add a picture description
and lower set temperature data, corresponding to 2023/1/7 in the data , 2023/1/8, 2023/1/9 data for three days.
Please add a picture description


Two changes have been made on the basis of Baseline (6.29551):

  1. Removed an incorrect value in the data (6.72811)
  2. Delete the data of 2023/1/7, 2023/1/8, 2023/1/9 (6.73844)

insert image description here


3. Feature Importance Analysis

Next, an attempt was made to analyze the efficient eigenanalysis for a single predictor variable

  1. Calculate the correlation matrix
df = pd.concat([X, y.iloc[:,0]], axis = 1) # X是处理后的训练数据,y是标签
corr_matrix = df.corr()
corr_matrix['上部温度1'].sort_values(ascending=False)
  1. lightgbmThe feature importance for
feature_importance = model.feature_importance()
sorted_features = sorted(zip(X.columns, feature_importance), key=lambda x: x[1], reverse=True)

# 打印按照feature_importance值降序排列的特征列表
for feature, importance in sorted_features:
    print(f"{
      
      feature}: {
      
      importance}")

The correlation matrix calculates a linear correlation, so the results observed are somewhat different from the results of the feature importance of lightgbm.

The next step is to construct different features and perform feature screening for the results of each prediction output.

Study Record 2 (Updated on 2023.07.27)

In the past few days, it was mainly a struggle with Baseline, and then all failed. Attempts have been made at the data level, feature engineering, data division methods, and post-processing.

1. Data plane

There are missing values ​​(0) and outliers in the data found last time, and they are filtered out and modified by searching. Recently, through visualization, there are outliers in almost every feature. Especially reflected in the traffic characteristics.

Please add a picture description
The figure above is a line chart of 17 traffic features in the training data. It can be seen that there are still great fluctuations.
Please add a picture description
The results obtained by using the median for sliding window filtering, where the red curve is the result of filtering the test set.
Code used for filtering:

def smooth_t(df, cols):
    df = df.copy()

    window_size = 5
    for col in cols:
        df[f'smoothed_{
      
      col}'] = df[col].rolling(window=window_size, center=True).median()
        outlier_threshold = 5.0
        df['absolute_diff'] = abs(df[col] - df[f'smoothed_{
      
      col}'])
        outliers = df['absolute_diff'] > outlier_threshold
        df.loc[outliers, col] = df.loc[outliers, f'smoothed_{
      
      col}']
        df.drop(columns=[f'smoothed_{
      
      col}', 'absolute_diff'], inplace=True)
    return df

Results : The filtered data does not reduce the MAE on the test set.

2. Feature Engineering

Mainly from three aspects:

  1. Traffic characteristics: The variance, mean and coefficient of variation within a certain time range (one day) are constructed. Among them, the coefficient of variation performed well, and the score of the feature importance of the tree model was higher than that of the original traffic feature, but the performance on the test set after training with this feature was not better, and how to retain and delete the feature was still somewhat confusing.
  2. Temperature setting features: These features have a strong linear correlation with the target in the corr correlation, but they perform poorly in the feature importance evaluation of the tree model, and the value is relatively stable, so it is not easy to construct derived features. I tried to build it into discrete features, replacing all values ​​with the most n values, and using 1, 0, -1 to build the changes before and after it. But these features perform poorly.
  3. Then I tried to use the fully automatic feature generator openFE. The effect is also average.
from openfe import OpenFE, transform, tree_to_formula

ofe = OpenFE()
features = ofe.fit(data = train_x, label = train_y, n_jobs=12) # n_jobs:指定CPU内核数
train_x_feature, test_dataset_feture = transform(train_x, test_x, features[:20], n_jobs = 12)

# 查看前20个高分特征
for feature in ofe.new_features_list[:20]:
    print(tree_to_formula(feature))

3. Data division method

Because it is time-series data, train_test_splitthere is time-travel when used directly, so the effect of using the time-series data is very poor, and the performance is not as good as TimeSeriesSplitusing it .KFoldtrain_test_split

4. Post-processing

  1. Hillclimbing library - for model fusion
    I found a library for model fusion, but because only lightgbm is currently used, I haven't tried it yet.
!pip install hillclimbers

from hillclimbers import climb, partial

def climb_hill(
    train=None, 
    oof_pred_df=None, 
    test_pred_df=None, 
    target=None, 
    objective=None, 
    eval_metric=None,
    negative_weights=False, 
    precision=0.01, 
    plot_hill=True, 
    plot_hist=False
) -> np.ndarray: # Returns test predictions resulting from hill climbing
  1. Post-processing technique
    I found an interesting processing technique, but it is not applicable to this data set, but I used this idea to construct a discrete encoding of the temperature setting feature, although the effect is also very poor.
# 1. 存储target唯一值
unique_targets = np.unique(trian['yield'])
# 2. 完成对训练和测试数据的预测
off_preds = model.prdict(X_val[features])
test_preds += model.predict(test[features]) / n_splits
# 四舍五入到最接近的唯一目标值
off_preds = [min(unique_targets, key = lambda x: abs(x-pred)) for pred in oof_preds]
test_preds = [min(unique_targets, key = lambda x: abs(x-pred)) for pred in test_preds]

Study Record 3 (Updated on 2023.07.30)

After watching Yulao's live broadcast, I gained a lot. I spent a lot of time on feature selection before, but later found that the amount of features constructed was far from the amount of feature selection. In addition, when the score has not improved, start with data analysis. Therefore, the data was carefully analyzed in the past two days, and new discoveries were made.

Firstly, we counted the recording of data for each hour from 2022-11-06 09:00:00to to on the training set . The data example is as follows:2023-03-01 04:00:00

2022-11-06 09:00:00,40
2022-11-06 10:00:00,47
2022-11-06 11:00:00,46
2022-11-06 12:00:00,47
2022-11-06 13:00:00,47
2022-11-06 14:00:00,48
2022-11-06 15:00:00,47
......

insert image description here
The training set is divided into five parts according to the sampling frequency of the data in each hour. For example, the data of 2022-11-7and 2022-11-8are sampled about 48 times per hour, and then the data is divided into 4 parts with four steps, so that about 12 samples per hour are obtained. Similar operations are performed on the rest of the training set, so that the overall data is kept at about 12 samples per hour.
insert image description here
The purpose of this is because the test set all maintains a sampling frequency of 11 or 12 times per hour.
insert image description here

After such processing, equal time steps on the training set and test set are achieved. It is convenient for subsequent construction of timing features.
In addition, splitting the training set also effectively avoids the impact of vacant time.

In addition, the manual selection of the validation set ensures the consistency of the distribution of the validation set and the test set, enabling better online and offline fitting.

The time-series correlation features from previous attempts were then reconstructed using the processed data with equal time steps. And use the baseline method for simple verification. This time I changed to a faster machine, and in the end the MAE 7.52was even worse. Then I ran the baseline again and found that the baseline changed this time 8.51, and the score of the previous baseline remained the same 6.29.


Summary : These two days of trying to process data took too much time, and it is not easy to evaluate the quality of the program. I feel that the results are affected by many factors and it is not easy to control variables. The entire programming process is also chaotic, and needs to be strengthened later. practise. Some of the ideas learned from Yulao’s live broadcast have not been practiced yet, so I will try again in the last few days.

Guess you like

Origin blog.csdn.net/qq_38869560/article/details/131873308