Memo
1. Introduction to the competition questions
1. Read data and view data distribution
3.QQ chart and BOX-COX transformation
1.catboost and lightgbm feature processing
1.catboost+lightgbm simple weighted fusion
3. Model weighted fusion with new weights
6. Online scores and rankings
Preface
The main experience of this competition is model fusion. Because the distribution of the test set and the training set are inconsistent in the data set, it is easy for the online test score to be higher than the offline test. Finally, after the model weight screening, catboost, lightgbm, linear , random forest four models, fixed the score at 0.1146.
1. Introduction to the competition questions
The basic principle of thermal power generation is: when fuel is burned, water is heated to generate steam. The pressure of the steam drives the turbine to rotate, and then the steam turbine drives the generator to rotate to generate electrical energy. In this series of energy conversions, the core that affects power generation efficiency is the combustion efficiency of the boiler, that is, fuel is burned to heat water to produce high-temperature and high-pressure steam. There are many factors that affect the combustion efficiency of the boiler, including the adjustable parameters of the boiler, such as combustion feed, primary and secondary air, induced air, return air, feed water volume; and the operating conditions of the boiler, such as boiler bed temperature and bed pressure. Furnace temperature, pressure, superheater temperature, etc.
"V0"-"V37", these 38 fields are used as feature variables, and "target" is used as the target variable. All variables are numerical variables and there are no missing values.
The ranking results are based on the MSE (mean square error) of the predicted results.
2. Data exploration
1. Read data and view data distribution
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
with open("/zhengqi_train.txt") as fr:
df=pd.read_table(fr,sep="\t")
with open("/zhengqi_test.txt") as fr:
test=pd.read_table(fr,sep="\t")
# 获取数值变量
Nu_feature = list(df.select_dtypes(exclude=['object']).columns)
# 绘制数据分布
plt.figure(figsize=(30,25))
i=1
for col in Nu_feature:
ax=plt.subplot(7,6,i)
ax=sns.kdeplot(df[col],color='red')
ax=sns.kdeplot(test[col],color='cyan')
ax.set_xlabel(col)
ax.set_ylabel('Frequency')
ax=ax.legend(['train','test'])
i+=1
plt.show()
It can be seen that the train and test distributions of some variables are inconsistent. This is closer to reality. When linear regression modeling is used later, some data with inconsistent distributions need to be removed to avoid overfitting.
2. Data correlation
correlation_matrix=df.corr()
plt.figure(figsize=(12,10))
sns.heatmap(correlation_matrix,vmax=0.9,linewidths=0.05,cmap="RdGy")
Some variables have a high correlation with the target variable. Select some variables to draw a scatter plot with the target variable.
col=['V0', 'V1','V2', 'V3', 'V5', 'V11', 'V24','V37','target']
sns.pairplot(df[col],hue='target',kind='scatter')
Since the target variable is a continuous value, it is not as intuitive as the discrete value, but the linear change can still be seen.
3.QQ chart and BOX-COX transformation
Let me explain here, since some data do not obey the normal distribution, box-cox can be used to modify the data to make it obey the normal distribution, which can be used for data modeling and improve the ability of the model. However, I did not use the transformed Data modeling, only the difference in data distribution after transformation is shown here for reference.
from sklearn.preprocessing import MinMaxScaler
from scipy import stats
# 归一化
cols_numeric=list(test.columns)
def scale_minmax(col):
return (col-col.min())/(col.max()-col.min())
train_data_process=df[cols_numeric].apply(scale_minmax,axis=0)
test_data_process=test[cols_numeric].apply(scale_minmax,axis=0)
# box-cox变换
for col in test.columns:
train_data_process.loc[:,col], _ = stats.boxcox(train_data_process.loc[:,col]+1)
test_data_process.loc[:,col], _ = stats.boxcox(test_data_process.loc[:,col]+1)
#QQ图
plt.figure(figsize=(30,40))
j=1
for col in test.columns:
ax=plt.subplot(14,6,j)
sns.distplot(test_data_process[col],fit=stats.norm)
ax.set_xlabel(col)
j+=1
ax=plt.subplot(14,6,j)
stats.probplot(test_data_process[col],dist=stats.norm, plot=plt)
j+=1
plt.subplots_adjust(wspace=0.3,hspace=0.5) # 调整图间距
plt.show()
First, train_QQ map before transformation
Transformed train_QQ graph
Through comparison, it can be seen that the data after box-cox transformation is more consistent with the normal distribution.
3. Feature processing
Feature processing is mainly for catboost and lightgbm. Random forest directly uses the original data for modeling, and linear regression deletes some variables for modeling based on the original data.
For integrated models, build as many features as possible to improve model accuracy.
For linear regression models where the distribution of some variables is inconsistent, deleting these variables can avoid overfitting.
1.catboost and lightgbm feature processing
a. Feature crossover
num_cols = [0,1,2,3,10,12,8]
for index, value in enumerate(num_cols):
for j in num_cols[index+1:]:
df['new'+str(value)+'+'+str(j)]=df['V'+str(value)]+df['V'+str(j)]
test['new'+str(value)+'+'+str(j)]=test['V'+str(value)]+test['V'+str(j)]
num_cols = [0,1,2,3,16,31]
for index, value in enumerate(num_cols):
for j in num_cols[index+1:]:
df['new'+str(value)+'+'+str(j)]=df['V'+str(value)]+df['V'+str(j)]
test['new'+str(value)+'+'+str(j)]=test['V'+str(value)]+test['V'+str(j)]
b. Average encoding
# 分离变量
X=df.drop(columns='target')
Y=df['target']
# 平均数编码
import Meancoder
class_list = ['V0','V1','V2','V3']
MeanEnocodeFeature = class_list
ME = Meancoder.MeanEncoder(MeanEnocodeFeature,target_type='regression') # 声明平均数编码的类
X = ME.fit_transform(X,Y) # 对训练数据集的X和y进行拟合
test = ME.transform(test_data_process)#对测试集进行编码
2.Linear feature processing
# 删除部分分布不一致变量
df.drop(['V2','V5','V9','V11','V13','V14','V17','V19','V20','V21','V22','V27'],axis=1,inplace=True)
test.drop(['V2','V5','V9','V11','V13','V14','V17','V19','V20','V21','V22','V27'],axis=1,inplace=True)
Note: Different models use different variable characteristics and are run separately to obtain different results.
4. Build a model
1.catboost+lightgbm+5KFold
from catboost import CatBoostRegressor
from lightgbm.sklearn import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# 划分训练及测试集
x_train,x_test,y_train,y_test = train_test_split( X, Y,test_size=0.3,random_state=1)
# 模型训练
clf=CatBoostRegressor(loss_function="MAE",
eval_metric= 'R2',
task_type="CPU",
od_type="Iter", #过拟合检查类型
depth=7,
learning_rate=0.02,
iterations=5000,
random_seed=2022)
gbm = LGBMRegressor(n_estimators=5000,learning_rate=0.02,boosting_type= 'gbdt',
objective = 'regression_l1',
max_depth = -1,
random_state=2022,
metric='mse')
# 5折训练
result = []
mean_score = 0
result1 = []
mean_score1 = 0
n_folds=5
kf = KFold(n_splits=n_folds ,shuffle=True,random_state=2022)
for train_index, test_index in kf.split(X):
x_train = X.iloc[train_index]
y_train = Y.iloc[train_index]
x_test = X.iloc[test_index]
y_test = Y.iloc[test_index]
# catboost训练
clf.fit(x_train,y_train,verbose=5000)
y_pred=clf.predict(x_test)
print('验证集MSE:{}'.format(mean_squared_error(y_test,y_pred)))
mean_score += mean_squared_error(y_test,y_pred)/ n_folds
y_pred_test = clf.predict(test)
result.append(y_pred_test)
# gbm训练
gbm.fit(x_train,y_train)
y_pred1=gbm.predict(x_test)
print('验证集MSE:{}'.format(mean_squared_error(y_test,y_pred1)))
mean_score1 += mean_squared_error(y_test,y_pred1)/ n_folds
y_pred_final1 = gbm.predict((test),num_iteration=gbm.best_iteration_)
y_pred_test1=y_pred_final1
result1.append(y_pred_test1)
# 模型评估
print('mean 验证集MSE:{}'.format(mean_score))
cat_pre=sum(result)/n_folds
np.savetxt('/test.txt',cat_pre)
print('mean 验证集mse:{}'.format(mean_score1))
cat_pre1=sum(result1)/n_folds
np.savetxt('/test_gbm.txt',cat_pre1)
2.linear+RandomForest
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
lr=LinearRegression()
rf=RandomForestRegressor(n_estimators=800,max_features='sqrt',random_state=0)
lr.fit(x_train,y_train)
rf.fit(x_train,y_train)
linear_predict = lr.predict(x_test)
rf_predict = rf.predict(x_test)
# 模型评估
print('LR_MSE:',mean_squared_error(y_test,linear_predict))
print('rf_MSE:',mean_squared_error(y_test,rf_predict))
# 预测
pre_lr = lr.predict(test)
pre_rf = rf.predict(test)
# 保存
np.savetxt('/pre_lr.txt',pre_lr)
np.savetxt('/pre_rf.txt',pre_rf)
Note hereLinear needs to delete variables and retrain to distinguish it from cat and gbm.
RandomForest uses original data for training without adding features or deleting variables.
5. Model fusion
The key to model fusion is the setting of weights. Different models and different weights will produce good results.
catboost+lightgbm+5KFold | 0.1253 |
catboost+lightgbm+Linear+RandomForest | 0.1181 |
catboost+lightgbm++5KFold+Linear+RandomForest | 0.1146 |
1.catboost+lightgbm simple weighted fusion
# 模型加权融合
sub_Weighted = (1-mean_score1/(mean_score1+mean_score))*cat_pre1+(1-mean_score/(mean_score1+mean_score))*cat_pre
There are only 0.1253 on this line
2. Model weight screening
def model_mix(pred, pred1,pred2,pred3):
result = pd.DataFrame(columns=['CatBoostRegressor','LGBMRegressor','Linear','RandomForest','Combine'])
for a in range (10):
for b in range(10):
for c in range(10):
for d in range(1,10):
y_pred3 = (a*pred + b*pred1 + c*pred2+ d*pred3) / (a+b+c+d)
mse = mean_squared_error(y_test,y_pred3)
result = result.append([{'CatBoostRegressor':a,
'LGBMRegressor':b,
'Linear':c,
'RandomForest':d,
'Combine':mse}],
ignore_index=True)
return result
model_combine = model_mix(y_pred,y_pred1,linear_predict,rf_predict)
model_combine.sort_values(by='Combine', inplace=True)
model_combine.head()
Note here that catboost and gbm must be consistent with linear and RandomForest, and use a single training to ensure that y_test is consistent.
The weight of the verification set here is 9941. After final testing, the weight coefficients are Linear: 9, RandomForest: 9, CatBoostRegressor: 9, CatBoostRegressor: 1. This online score: 0.1181
The weight screening of the validation set can provide a reference for the model weight and is a good screening method.
3. Model weighted fusion with new weights
with open("/pre_lr.txt") as fr:
df_lr=pd.read_table(fr,header=None,sep="\t")
with open("/pre_rf.txt") as fr:
df_rf=pd.read_table(fr,header=None,sep="\t")
with open("/test.txt") as fr:
df_test=pd.read_table(fr,header=None,sep="\t")
with open("/test_gbm1.txt") as fr:
df_test_gbm=pd.read_table(fr,header=None,sep="\t")
# 加权计算
mix_predict = (9*df_lr + 9*df_rf+9*df_test+1*df_test_gbm) /28
np.savetxt('/four_mix_4predict.txt',mix_predict)
6. Online scores and rankings
Summarize
1. For integrated models, try to retain the integrity of the data. For linear models, features with inconsistent distributions need to be deleted, otherwise it will be easy to overfit.
2. Feature engineering will also cause differences in results depending on the model and the random seeds divided.
3. The BOX-COX transformation is not used in this model and needs to be further studied.
4. Model fusion can obviously improve online scores, and there are many ways to use it.