Tianchi Race: Forecast of Industrial Steam Volume

​​​​​​​​​​​​​​Memo

​​​​​​​​​​​​​​​​Foreword

1. Introduction to the competition questions

2. Data exploration

1. Read data and view data distribution

2. Data correlation

3.QQ chart and BOX-COX transformation

3. Feature processing

1.catboost and lightgbm feature processing

a. Feature crossover

b. Average encoding

2.Linear feature processing

4. Build a model

1.catboost+lightgbm+5KFold

2.linear+RandomForest

5. Model fusion

1.catboost+lightgbm simple weighted fusion

2. Model weight screening

3. Model weighted fusion with new weights

6. Online scores and rankings​​​​​​​


Preface

The main experience of this competition is model fusion. Because the distribution of the test set and the training set are inconsistent in the data set, it is easy for the online test score to be higher than the offline test. Finally, after the model weight screening, catboost, lightgbm, linear , random forest four models, fixed the score at 0.1146.


1. Introduction to the competition questions

The basic principle of thermal power generation is: when fuel is burned, water is heated to generate steam. The pressure of the steam drives the turbine to rotate, and then the steam turbine drives the generator to rotate to generate electrical energy. In this series of energy conversions, the core that affects power generation efficiency is the combustion efficiency of the boiler, that is, fuel is burned to heat water to produce high-temperature and high-pressure steam. There are many factors that affect the combustion efficiency of the boiler, including the adjustable parameters of the boiler, such as combustion feed, primary and secondary air, induced air, return air, feed water volume; and the operating conditions of the boiler, such as boiler bed temperature and bed pressure. Furnace temperature, pressure, superheater temperature, etc.

"V0"-"V37", these 38 fields are used as feature variables, and "target" is used as the target variable. All variables are numerical variables and there are no missing values.

The ranking results are based on the MSE (mean square error) of the predicted results.

2. Data exploration

1. Read data and view data distribution

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
with open("/zhengqi_train.txt")  as fr:
    df=pd.read_table(fr,sep="\t")
with open("/zhengqi_test.txt")  as fr:
    test=pd.read_table(fr,sep="\t")
# 获取数值变量
Nu_feature = list(df.select_dtypes(exclude=['object']).columns)
# 绘制数据分布
plt.figure(figsize=(30,25))
i=1
for col in Nu_feature:
    ax=plt.subplot(7,6,i)
    ax=sns.kdeplot(df[col],color='red')
    ax=sns.kdeplot(test[col],color='cyan')
    ax.set_xlabel(col)
    ax.set_ylabel('Frequency')
    ax=ax.legend(['train','test'])
    i+=1
plt.show()

It can be seen that the train and test distributions of some variables are inconsistent. This is closer to reality. When linear regression modeling is used later, some data with inconsistent distributions need to be removed to avoid overfitting.

2. Data correlation

correlation_matrix=df.corr()
plt.figure(figsize=(12,10))
sns.heatmap(correlation_matrix,vmax=0.9,linewidths=0.05,cmap="RdGy")

 Some variables have a high correlation with the target variable. Select some variables to draw a scatter plot with the target variable.

col=['V0', 'V1','V2', 'V3', 'V5', 'V11', 'V24','V37','target']
sns.pairplot(df[col],hue='target',kind='scatter')

 Since the target variable is a continuous value, it is not as intuitive as the discrete value, but the linear change can still be seen.

3.QQ chart and BOX-COX transformation

Let me explain here, since some data do not obey the normal distribution, box-cox can be used to modify the data to make it obey the normal distribution, which can be used for data modeling and improve the ability of the model. However, I did not use the transformed Data modeling, only the difference in data distribution after transformation is shown here for reference.

from sklearn.preprocessing import MinMaxScaler    
from scipy import stats
# 归一化
cols_numeric=list(test.columns)
def scale_minmax(col):
    return (col-col.min())/(col.max()-col.min())
train_data_process=df[cols_numeric].apply(scale_minmax,axis=0)
test_data_process=test[cols_numeric].apply(scale_minmax,axis=0)
# box-cox变换
for col in test.columns:                   
    train_data_process.loc[:,col], _ = stats.boxcox(train_data_process.loc[:,col]+1)
    test_data_process.loc[:,col], _ = stats.boxcox(test_data_process.loc[:,col]+1)
#QQ图
plt.figure(figsize=(30,40))
j=1
for col in test.columns:
    ax=plt.subplot(14,6,j)
    sns.distplot(test_data_process[col],fit=stats.norm)
    ax.set_xlabel(col)
    j+=1
    ax=plt.subplot(14,6,j)
    stats.probplot(test_data_process[col],dist=stats.norm, plot=plt)
    j+=1
plt.subplots_adjust(wspace=0.3,hspace=0.5)  # 调整图间距
plt.show()

First, train_QQ map before transformation

Transformed train_QQ graph 

 Through comparison, it can be seen that the data after box-cox transformation is more consistent with the normal distribution.

3. Feature processing

Feature processing is mainly for catboost and lightgbm. Random forest directly uses the original data for modeling, and linear regression deletes some variables for modeling based on the original data.

For integrated models, build as many features as possible to improve model accuracy.

For linear regression models where the distribution of some variables is inconsistent, deleting these variables can avoid overfitting.

1.catboost and lightgbm feature processing

a. Feature crossover

num_cols = [0,1,2,3,10,12,8]
for index, value in enumerate(num_cols):
    for j in num_cols[index+1:]:
        df['new'+str(value)+'+'+str(j)]=df['V'+str(value)]+df['V'+str(j)]
        test['new'+str(value)+'+'+str(j)]=test['V'+str(value)]+test['V'+str(j)]

num_cols = [0,1,2,3,16,31]
for index, value in enumerate(num_cols):
    for j in num_cols[index+1:]:
        df['new'+str(value)+'+'+str(j)]=df['V'+str(value)]+df['V'+str(j)]
        test['new'+str(value)+'+'+str(j)]=test['V'+str(value)]+test['V'+str(j)]

b. Average encoding

# 分离变量
X=df.drop(columns='target')
Y=df['target']
# 平均数编码
import Meancoder   
class_list = ['V0','V1','V2','V3']
MeanEnocodeFeature = class_list   
ME = Meancoder.MeanEncoder(MeanEnocodeFeature,target_type='regression') # 声明平均数编码的类
X = ME.fit_transform(X,Y)   # 对训练数据集的X和y进行拟合
test = ME.transform(test_data_process)#对测试集进行编码

2.Linear feature processing

# 删除部分分布不一致变量
df.drop(['V2','V5','V9','V11','V13','V14','V17','V19','V20','V21','V22','V27'],axis=1,inplace=True)  
test.drop(['V2','V5','V9','V11','V13','V14','V17','V19','V20','V21','V22','V27'],axis=1,inplace=True)

Note: Different models use different variable characteristics and are run separately to obtain different results.

4. Build a model

1.catboost+lightgbm+5KFold

from catboost import CatBoostRegressor
from lightgbm.sklearn import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import  mean_absolute_error, mean_squared_error, r2_score 
# 划分训练及测试集
x_train,x_test,y_train,y_test = train_test_split( X, Y,test_size=0.3,random_state=1) 
# 模型训练
clf=CatBoostRegressor(loss_function="MAE",
                      eval_metric= 'R2',
                      task_type="CPU",
                      od_type="Iter",   #过拟合检查类型
                      depth=7,              
                      learning_rate=0.02,  
                      iterations=5000,     
                      random_seed=2022) 
gbm = LGBMRegressor(n_estimators=5000,learning_rate=0.02,boosting_type= 'gbdt',
    objective = 'regression_l1',
    max_depth = -1,  
    random_state=2022,
    metric='mse')
# 5折训练
result = []
mean_score = 0
result1 = []
mean_score1 = 0
n_folds=5
kf = KFold(n_splits=n_folds ,shuffle=True,random_state=2022) 
for train_index, test_index in kf.split(X):
    x_train = X.iloc[train_index]
    y_train = Y.iloc[train_index]
    x_test = X.iloc[test_index]
    y_test = Y.iloc[test_index]
    # catboost训练
    clf.fit(x_train,y_train,verbose=5000)
    y_pred=clf.predict(x_test) 
    print('验证集MSE:{}'.format(mean_squared_error(y_test,y_pred)))
    mean_score += mean_squared_error(y_test,y_pred)/ n_folds
    y_pred_test = clf.predict(test)
    result.append(y_pred_test)
    # gbm训练
    gbm.fit(x_train,y_train)
    y_pred1=gbm.predict(x_test)
    print('验证集MSE:{}'.format(mean_squared_error(y_test,y_pred1)))
    mean_score1 += mean_squared_error(y_test,y_pred1)/ n_folds
    y_pred_final1 = gbm.predict((test),num_iteration=gbm.best_iteration_)
    y_pred_test1=y_pred_final1
    result1.append(y_pred_test1)
# 模型评估
print('mean 验证集MSE:{}'.format(mean_score))
cat_pre=sum(result)/n_folds
np.savetxt('/test.txt',cat_pre)

print('mean 验证集mse:{}'.format(mean_score1))
cat_pre1=sum(result1)/n_folds
np.savetxt('/test_gbm.txt',cat_pre1)

2.linear+RandomForest

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
lr=LinearRegression()
rf=RandomForestRegressor(n_estimators=800,max_features='sqrt',random_state=0)
lr.fit(x_train,y_train)
rf.fit(x_train,y_train)
linear_predict = lr.predict(x_test)
rf_predict = rf.predict(x_test)
# 模型评估
print('LR_MSE:',mean_squared_error(y_test,linear_predict))
print('rf_MSE:',mean_squared_error(y_test,rf_predict))
# 预测
pre_lr = lr.predict(test)
pre_rf = rf.predict(test)
# 保存
np.savetxt('/pre_lr.txt',pre_lr)
np.savetxt('/pre_rf.txt',pre_rf)

Note hereLinear needs to delete variables and retrain to distinguish it from cat and gbm.

RandomForest uses original data for training without adding features or deleting variables.

5. Model fusion

The key to model fusion is the setting of weights. Different models and different weights will produce good results.

catboost+lightgbm+5KFold 0.1253
catboost+lightgbm+Linear+RandomForest 0.1181
catboost+lightgbm++5KFold+Linear+RandomForest 0.1146

1.catboost+lightgbm simple weighted fusion

# 模型加权融合
sub_Weighted = (1-mean_score1/(mean_score1+mean_score))*cat_pre1+(1-mean_score/(mean_score1+mean_score))*cat_pre

There are only 0.1253 on this line

2. Model weight screening

def model_mix(pred, pred1,pred2,pred3):
    result = pd.DataFrame(columns=['CatBoostRegressor','LGBMRegressor','Linear','RandomForest','Combine'])

    for a in range (10):
        for b in range(10):
            for c in range(10):
                for d in range(1,10):
                     y_pred3 = (a*pred + b*pred1 + c*pred2+ d*pred3) / (a+b+c+d)
   
                     mse = mean_squared_error(y_test,y_pred3)
                    
                     result = result.append([{'CatBoostRegressor':a, 
                                             'LGBMRegressor':b,
                                             'Linear':c,
                                             'RandomForest':d,
                                             'Combine':mse}],
                                             ignore_index=True)
    return result

model_combine = model_mix(y_pred,y_pred1,linear_predict,rf_predict)

model_combine.sort_values(by='Combine', inplace=True)  
model_combine.head()

Note here that catboost and gbm must be consistent with linear and RandomForest, and use a single training to ensure that y_test is consistent.

The weight of the verification set here is 9941. After final testing, the weight coefficients are Linear: 9, RandomForest: 9, CatBoostRegressor: 9, CatBoostRegressor: 1. This online score: 0.1181

The weight screening of the validation set can provide a reference for the model weight and is a good screening method.

3. Model weighted fusion with new weights

with open("/pre_lr.txt")  as fr:
    df_lr=pd.read_table(fr,header=None,sep="\t")
with open("/pre_rf.txt")  as fr:
    df_rf=pd.read_table(fr,header=None,sep="\t")
with open("/test.txt")  as fr:
    df_test=pd.read_table(fr,header=None,sep="\t")
with open("/test_gbm1.txt")  as fr:
    df_test_gbm=pd.read_table(fr,header=None,sep="\t")
# 加权计算
mix_predict = (9*df_lr + 9*df_rf+9*df_test+1*df_test_gbm) /28
np.savetxt('/four_mix_4predict.txt',mix_predict)

6. Online scores and rankings


 Summarize

1. For integrated models, try to retain the integrity of the data. For linear models, features with inconsistent distributions need to be deleted, otherwise it will be easy to overfit.

2. Feature engineering will also cause differences in results depending on the model and the random seeds divided.

3. The BOX-COX transformation is not used in this model and needs to be further studied.

4. Model fusion can obviously improve online scores, and there are many ways to use it.

Guess you like

Origin blog.csdn.net/weixin_46685991/article/details/127448485